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(2) RELATED APPEALS AND INTERFERENCES 
Appellants, their legal representative and the assignee are not aware of any related 
appeals or interferences which will directly affect or be directly affected by or have a bearing on 
the Board's decision in the instant appeal. 



Claims rejected: 
Claims allowed: 
Claims canceled: 
Claims withdrawn: 
Claims on Appeal: 



(3) STATUS OF THE CLAIMS 
Claims 23-31 
(none) 
Claims 1-22 
Claims 32-40 

Claims 23-31 (A copy of the claims on appeal, as amended, can be 
found in the attached Appendix). 



(4) STATUS OF AMENDMENTS AFTER FINAL 
The Amendment after Final Rejection under 37 C.F.R. §1.116 filed December 2, 
2003 has been entered for purposes of this appeal. 



(5) SUMMARY OF THE INVENTION 
Appellants' invention is directed to polynucleotides comprising the polynucleotide 
sequence of SEQ ID NO:2, encoding the human myosin heavy chain homolog (MHCH), comprising 
the amino acid sequence of SEQ ID NO:l (Specification, e.g., at page 2, lines 28-30 an page 3, 
lines 12-13). Appellants' invention also includes polynucleotides comprising a naturally 
occurring polynucleotide sequence at least 70% identical to a polynucleotide sequence of SEQ ID 
NO:2, and polynucleotides encoding a polypeptide comprising a naturally occurring amino acid 
sequence at least 90% identical to an amino acid sequence of SEQ ID NO:l, said polypeptide 
having ATPase activity, (e.g., at page 2, lines 30-33; page 3, lines 14-16; page 17, lines 8-13; and 
page 18" lines 18-33), polynucleotides encoding biologically active fragments of SEQ ID NO:l, 
complementary polynucleotides (e.g., at page 3, lines 2-5 and 12-13; and page 7, lines 4-5), 
recombinant polynucleotides encoding polypeptides comprising SEQ ID NO:l, variants, or 
fragments thereof (e.g., at page 3, lines 19-22), host cells transformed with recombinant 
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polynucleotides, and methods of making polypeptides encoded by the claimed polynucleotides 
(e.g., at page 3, lines 21-26; pages 22-25; and Example IX at pages 47-48; and Example XIII at 
page 50). 

MHCH has chemical and structural homology to Caenorhabditis elegans myosin I heavy 

chain (GI 1279777; SEQ ID NO:3), and Helianthus annus unconventional myosin I heavy chain 

(GI 2444174; SEQ ID NO:4). For example: 

MHCH is 612 amino acids in length and has 7 potential casein kinase II 
phosphorylation sites at residues S62, T146, T221, S280, S323, S390, and T546, 
and 6 potential protein kinase C phosphorylation sites at residues S19, S140, 
S303, T441, S555, and S563. The sequence from T383 through M387 of MHCH 
is 80% identical to the conserved sequence found at the end of myosin head 
domains. MHCH contains two possible light-chain binding sites. The first, from 
1410 through E421, contains 4 out of 6 conserved residues and the second, from 
1432 through K443, contains 5 out of 6 conserved residues. PFAM analysis 
shows that MHCH shares homology with a myosin head domain from residue F5 1 
to residue L314. PRINTS analysis shows that MHCH shares homology with a 
myosin heavy chain signature motif from residue F51 to K79 and from residue 
F105 to C133. MHCH has a possible transmembrane motif from residue W506 to 
P535. As shown in Figures 2A, 2B, 2C, 2D, 2E, 2F, 2G, 2H, 21, 2J and 2K, 
MHCH has chemical and structural similarity with Caenorhabditis elegans myosin 
I heavy chain (GI 1279777; SEQ ID NO:3), and Helianthus annus unconventional 
myosin heavy chain (GI 2444174; SEQ ID NO:4). MHCH and myosin I heavy 
chain share 23.2% identity, and in particular they share 39% identity from residue 
F51 to residue L314. MHCH and unconventional myosin I share 22.4% identity, 
and in particular they share 38% identity from residue F51 to residue L314 of 
MHCH. (Specification, page 17, line 27 to page 18, line 9). 

The polynucleotides of the present invention are useful, for example, for toxicology 
testing, drug discovery, and diagnosis, prevention, and treatment of heart and skeletal muscle 
disorders, developmental disorders, and cell proliferative disorders, including cancer. 



(6) ISSUES 

L Whether claims 23-31 directed to polynucleotides encoding MHCH sequences 
meet the utility requirement of 35 U.S.C. §101. 

2. Whether one of ordinary skill in the art would know how to use the claimed 
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sequences, e.g., in toxicology testing, drug development, and the diagnosis of disease, so as to 
satisfy the enablement requirement of 35 U.S.C. §1 12, first paragraph. 

3. Whether one of ordinary skill in the art would know how to make the claimed 
polynucleotide variants and fragments according to claims 23, 26-28, and 30, so as to satisfy the 
enablement requirement of 35 U.S.C. §112, first paragraph. 

4. Whether claims 23, 26-28, 30, and 31 meet the written description requirement of 
35 U.S.C. §112, first paragraph. 

(7) GROUPING OF THE CLAIMS 

As to Issue 1 

All of the claims on appeal are grouped together. 
As to Issue 2 

All of the claims on appeal are grouped together. 
As to Issue 3 

All of the claims on appeal are grouped together. 
As to Issue 4 

All of the claims on appeal are grouped together. 

(8) APPELLANTS' ARGUMENTS 

Issue 1 - Whether the claims meet the utility requirement of 35 U.S.C. § 101 

The rejection of claims 23-31 is improper, as the inventions of those claims have a 
patentable utility as set forth in the instant specification, and/or a utility well known to one 
of ordinary skill in the art. 

The invention at issue is a polynucleotide sequence corresponding to a gene that is 
expressed in hematopoietic/immune system, gastrointestinal, musculoskeletal, and reproductive 
tissues, and in tissues associated with cancer (Specification at page 18, lines 12-17). In 
particular, similarities between SEQ ED NO:l and C. elegans myosin (gl279777) and H. annuus 
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unconventional myosin (g2444174), including the presence of myosin head domain, myosin 
heavy chain, and light chain binding site signatures, are described in the specification, for 
example, at page 17, line 30 through page 18, line 9. The specification points out the roles of 
myosin in muscle contraction, intracellular movement, phagocytosis, and cytokinesis, and 
describes various diseases associated with myosin dysfunction, including muscle disorders, 
cardiovascular disease, deafness, and cancer (Specification at pages 1-2). As such, the claimed 
invention has numerous practical, beneficial uses in toxicology testing, drug development, and 
the diagnosis of disease, none of which requires knowledge of how the polypeptide coded for by 
the polynucleotide actually functions. 

Appellants previously submitted the Declaration of Dr. Tod Bedilion describing some of 
the practical uses of the claimed invention in gene and protein expression monitoring 
applications. The Bedilion Declaration demonstrates that the positions and arguments made by 
the Patent Examiner with respect to the utility of the claimed polynucleotide are without merit. 

The Bedilion Declaration describes, in particular, how the claimed expressed 
polynucleotide can be used in gene expression monitoring applications that were well-known at 
the time the patent application was filed, and how those applications are useful in developing 
drugs and monitoring their 

activity. Dr. Bedilion states that the claimed invention is a useful tool when employed as a 

highly specific probe in a cDNA microarray: 

Persons skilled in the art would appreciate that cDNA microarrays that contained the SEQ 
ID NO:l-encoding polynucleotides would be a more useful tool than cDNA microarrays 
that did not contain the polynucleotides in connection with conducting gene expression 
monitoring studies on proposed (or actual) drugs for treating heart and skeletal muscle 
disorders, developmental disorders, and cell proliferative disorders, including cancer for 
such purposes as evaluating their efficacy and toxicity. 

The Patent Examiner does not dispute that the claimed polynucleotide can be used as a 
probe in cDNA microarrays and used in gene expression monitoring applications. Instead, the 
Patent Examiner contends that the claimed polynucleotide cannot be useful without precise 
knowledge of its biological function. But the law never has required knowledge of biological 
function to prove utility. It is the claimed invention's uses, not its functions, that are the subject 
of a proper analysis under the utility requirement. 
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In any event, as demonstrated by the Bedilion Declaration, the person of ordinary skill in 
the art can achieve beneficial results from the claimed polynucleotide in the absence of any 
knowledge as to the precise function of the protein encoded by it. The uses of the claimed 
polynucleotide in gene expression monitoring applications are in fact independent of its precise 
function. 

I. The Applicable Legal Standard 

To meet the utility requirement of sections 101 and 1 12 of the Patent Act, the patent 

applicant need only show that the claimed invention is "practically useful," Anderson v. Natta, 

480 R2d 1392, 1397, 178 USPQ 458 (CCPA 1973) and confers a "specific benefit" on the 

public. Brenner v. Manson, 383 U.S. 519, 534-35, 148 USPQ 689 (1966). As discussed in a 

recent Court of Appeals for the Federal Circuit case, this threshold is not high: 

An invention is "useful" under section 101 if it is capable of providing some identifiable 
benefit. See Brenner v. Manson, 383 U.S. 519, 534 [148 USPQ 689] (1966); Brooktree 
Corp. v. Advanced Micro Devices, Inc., 977 F.2d 1555, 1571 [24 USPQ2d 1401] (Fed. 
Cir. 1992) ("to violate Section 101 the claimed device must be totally incapable of 
achieving a useful result"); Fuller v. Berger, 120 F. 274, 275 (7th Cir. 1903) (test for 
utility is whether invention "is incapable of serving any beneficial end"). 

Juicy Whip Inc. v. Orange Bang Inc., 51 USPQ2d 1700 (Fed. Cir. 1999). 

While an asserted utility must be described with specificity, the patent applicant need not 

demonstrate utility to a certainty. In Stiftung v. Renishaw PLC, 945 F.2d 1173, 1180, 

20 USPQ2d 1094 (Fed. Cir. 1991), the United States Court of Appeals for the Federal Circuit 

explained: 

An invention need not be the best or only way to accomplish a certain result, and it need 
only be useful to some extent and in certain applications: "[T]he fact that an invention has 
only limited utility and is only operable in certain applications is not grounds for finding 
lack of utility." Envirotech Corp. v. Al George, Inc., 730 F.2d 753, 762, 221 USPQ 473, 
480 (Fed. Cir. 1984). 

The specificity requirement is not, therefore, an onerous one. If the asserted utility is 
described so that a person of ordinary skill in the art would understand how to use the claimed 
invention, it is sufficiently specific. See Standard Oil Co. v. Montedison, S.p.a., 212 U.S.P.Q. 
327, 343 (3d Cir. 1981). The specificity requirement is met unless the asserted utility amounts to 
a "nebulous expression" such as "biological activity" or "biological properties" that does not 
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convey meaningful information about the utility of what is being claimed. Cross v. Iizuka, 
753 F.2d 1040, 1048 (Fed. Cir. 1985). 

In addition to conferring a specific benefit on the public, the benefit must also be 
"substantial." Brenner, 383 U.S. at 534. A "substantial" utility is a practical, "real-world" 
utility. Nelson v. Bowler, 626 F.2d 853, 856, 206 USPQ 881 (CCPA 1980). 

If persons of ordinary skill in the art would understand that there is a "well-established" 
utility for the claimed invention, the threshold is met automatically and the applicant need not 
make any showing to demonstrate utility. Manual of Patent Examination Procedure at 
§ 706.03(a). Only if there is no "well-established" utility for the claimed invention must the 
applicant demonstrate the practical benefits of the invention. Id. 

Once the patent applicant identifies a specific utility, the claimed invention is presumed 
to possess it. In re Cortright, 165 F.3d 1353, 1357, 49 USPQ2d 1464 (Fed. Cir. 1999); In re 
Brana, 51 F.3d 1560, 1566; 34 USPQ2d 1436 (Fed. Cir. 1995). In that case, the Patent Office 
bears the burden of demonstrating that a person of ordinary skill in the art would reasonably 
doubt that the, asserted utility could be achieved by the claimed invention. Id. To do so, the 
Patent Office must provide evidence or sound scientific reasoning. See In re hanger, 503 F.2d 
1380, 1391-92, 183 USPQ 288 (CCPA 1974). If and only if the Patent Office makes such a 
showing, the burden shifts to the applicant to provide rebuttal evidence that would convince the 
person of ordinary skill that there is sufficient proof of utility. Brana, 51 F.3d at 1566. The 
applicant need only prove a "substantial likelihood" of utility; certainty is not required. Brenner, 
383 U.S. at 532. 

II. Use of the claimed polynucleotide for diagnosis of conditions or diseases 

characterized by expression of MHCH, for toxicology testing, and for drug 
discovery are sufficient utilities under 35 U.S.C. §§ 101 and 112, first paragraph 

The claimed invention meets all of the necessary requirements for establishing a credible 
utility under the Patent Law: There are "well-established" uses for the claimed invention known 
to persons of ordinary skill in the art, and there are specific practical and beneficial uses for the 
invention disclosed in the patent application's specification. These uses are explained, in detail, 
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in the Bedilion Declaration previously submitted on July 7, 2003. Objective evidence, not 
considered by the Patent Office, further corroborates the credibility of the asserted utilities. 

A. The use of MHCH for toxicology testing, drug discovery, and disease 
diagnosis are practical uses that confer "specific benefits" to the public 

The claimed invention has specific, substantial, real-world utility by virtue of its use in 
toxicology testing, drug development and disease diagnosis through gene expression profiling. 
These uses are explained in detail in the accompanying Bedilion Declaration, the substance of 
which is not rebutted by the Patent Examiner. There is no dispute that the claimed invention is in 
fact a useful tool in cDNA microarrays used to perform gene expression analysis. That is 
sufficient to establish utility for the claimed polynucleotide. 

In his Declaration, Dr. Bedilion explains the many reasons why a person skilled in the art 
reading the Tang '248 application on November 5, 1998 would have understood that application 
to disclose the claimed polynucleotide to be useful for a number of gene expression monitoring 
applications, e.g., as a highly specific probe for the expression of that specific polynucleotide in 
connection with the development of drugs and the monitoring of the activity of such drugs. 
(Bedilion Declaration at, e.g., ffl 10-15). Much, but not all, of Dr. Bedilion's explanation 
concerns the use of the claimed polynucleotide in cDNA microarrays of the type first developed 
at Stanford University for evaluating the efficacy and toxicity of drugs, as well as for other 
applications. (Bedilion Declaration, ffl 12 and 15). 1 

In connection with his explanations, Dr. Bedilion states that the "Tang '248 specification 
would have led a person skilled in the art on November 5, 1998 who was using gene expression 
monitoring in connection with working on developing new drugs for the treatment of heart and 
skeletal muscle disorders, developmental disorders, and cell proliferative disorders, including 
cancer [a] to conclude that a cDNA microarray that contained the SEQ ID NO:l-encoding 
polynucleotides would be a highly useful tool, and [b] to request specifically that any cDNA 
microarray that was being used for such purposes contain the SEQ ID NO:l-encoding 



! Dr. Bedilion also explained, for example, why persons skilled in the art would also 
appreciate, based on the Tang '248 specification, that the claimed polynucleotide would be useful 
in connection with developing new drugs using technology, such as Northern analysis, that 
predated by many years the development of the cDNA technology (Bedilion Declaration, % 16). 
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polynucleotides" (Bedilion Declaration, 115). For example, as explained by Dr. Bedilion, 
"[p]ersons skilled in the art would [have appreciated on November 5, 1998] that a cDNA 
microarray that contained the SEQ ID NO:l-encoding polynucleotides would be a more useful 
tool than a cDNA microarray that did not contain the polynucleotides in connection with 
conducting gene expression monitoring studies on proposed (or actual) drugs for treating heart 
and skeletal muscle disorders, developmental disorders, and cell proliferative disorders, 
including cancer for such purposes as evaluating their efficacy and toxicity." Id. 

In support of those statements, Dr. Bedilion provided detailed explanations of how cDNA 
technology can be used to conduct gene expression monitoring evaluations, with extensive 
citations to pre-November 5, 1998 publications showing the state of the art on November 5, 
1998. (Bedilion Declaration, % % 10-14). While Dr. Bedilion's explanations in paragraph 15 of 
his Declaration include almost three pages of text and six subparts (a)-(f), he specifically states 
that his explanations are not "all-inclusive." Id. For example, with respect to toxicity 
evaluations, Dr. Bedilion had earlier explained how persons skilled in the art who were working 
on drug development on November 5, 1998 (and for several years prior to November 5, 1998) 
"without any doubt" appreciated that the toxicity (or lack of toxicity) of any proposed drug was 
"one of the most important criteria to be evaluated in connection with the development of the 
drug" and how the teachings of the Tang '248 application clearly include using differential gene 
expression analyses in toxicity studies (Bedilion Declaration, % 10). 

Thus, the Bedilion Declaration establishes that persons skilled in the art reading the Tang 
'248 application at the time it was filed "would have wanted their cDNA microarray to have a 
[SEQ ID NO: 1 -encoding polynucleotide probe] because a microarray that contained such a probe 
(as compared to one that did not) would provide more useful results in the kind of gene 
expression monitoring studies using cDNA microarrays that persons skilled in the art have been 
doing since well prior to November 5, 1998" (Bedilion Declaration, \ 15, item (f)). This, by 
itself, provides more than sufficient reason to compel the conclusion that the Tang '248 
application disclosed to persons skilled in the art at the time of its filing substantial, specific and 
credible real-world utilities for the claimed polynucleotide. 

Nowhere does the Patent Examiner address the fact that, as described on pp. 31-32 of the 
Tang '248 application, the claimed polynucleotides can be used as highly specific probes in, for 
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example, cDNA microarrays - probes that without question can be used to measure both the 
existence and amount of complementary RNA sequences known to be the expression products of 
the claimed polynucleotides. The claimed invention is not, in that regard, some random sequence 
whose value as a probe is speculative or would require further research to determine. 

Given the fact that the claimed polynucleotide is known to be expressed, its utility as a 
measuring and analyzing instrument for expression levels is as indisputable as a scale's utility for 
measuring weight. This use as a measuring tool, regardless of how the expression level data 
ultimately would be used by a person of ordinary skill in the art, by itself demonstrates that the 
claimed invention provides an identifiable, real-world benefit that meets the utility requirement. 
Raytheon v. Roper, 724 F.2d 951, (Fed. Cir. 1983) (claimed invention need only meet one of its 
stated objectives to be useful); In re Cortwright, 165 F.3d 1353, 1359 (Fed. Cir. 1999) (how the 
invention works is irrelevant to utility); MPEP § 2107 ("Many research tools such as gas 
chromatographs, screening assays, and nucleotide sequencing techniques have a clear, specific, 
and unquestionable utility (e.g., thev are useful in analyzing compounds )" (emphasis added)). 

Though Appellants need not so prove to demonstrate utility, there can be no reasonable 
dispute that persons of ordinary skill in the art have numerous uses for information about relative 
gene expression including, for example, understanding the effects of a potential drug for treating 
heart and skeletal muscle disorders, developmental disorders, and cell proliferative disorders, 
including cancer. Because the patent application states explicitly that the claimed polynucleotide 
is known to be expressed in hematopoietic/immune system, gastrointestinal, musculoskeletal, 
and reproductive tissues, and in tissues associated with cancer (Specification at page 18, lines 12- 
17), and expresses a protein that is a member of the myosin family known to be associated with 
diseases such as heart and skeletal muscle disorders, developmental disorders, and cell 
proliferative disorders, including cancer, there can be no reasonable dispute that a person of 
ordinary skill in the art could put the claimed invention to such use. In other words, the person of 
ordinary skill in the art can derive more information about a potential heart and skeletal muscle 
disorders, developmental disorders, and cell proliferative disorders, including cancer drug 
candidate or potential toxin with the claimed invention than without it (see Bedilion Declaration 
at, e.g., «][ 15, subparts (e)-(f)). 
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The Bedilion Declaration shows that a number of pre-November 5, 1998 publications 

confirm and further establish the utility of cDNA microarrays in a wide range of drug 

development gene expression monitoring applications at the time the Tang '248 application was 

filed (Bedilion Declaration fl 10-14; Bedilion Exhibits A-G). Indeed, Brown and Shalon U.S. 

Patent No. 5,807,522 (the Brown '522 patent, Bedilion Exhibit D), which issued from a patent 

application filed in June 1995 and was effectively published on December 29, 1995 as a result of 

the publication of a PCT counterpart application, shows that the Patent Office recognizes the 

patentable utility of the cDNA technology developed in the early to mid-1990s. As explained by 

Dr. Bedilion, among other things (Bedilion Declaration, % 12): 

The Brown '522 patent further teaches that the "[m]icroarrays of immobilized 
nucleic acid sequences prepared in accordance with the invention" can be used in 
"numerous" genetic applications, including "monitoring of gene expression" 
applications (see Bedilion Tab D at col. 14, lines 36-42). The Brown '522 patent 
teaches (a) monitoring gene expression (i) in different tissue types, (ii) in different 
disease states, and (iii) in response to different drugs, and (b) that arrays disclosed 
therein may be used in toxicology studies (see Bedilion Tab D at col. 15, lines 13- 
18 and 52-58 and col. 18, lines 25-30). 

Literature reviews published shortly after the filing of the Tang '248 application 

describing the state of the art further confirm the claimed invention's utility. Rockett et al. 

confirm, for example, that the claimed invention is useful for differential expression analysis 

regardless of how expression is regulated: 

Despite the development of multiple technological advances which have recently 
brought the field of gene expression profiling to the forefront of molecular 
analysis, recognition of the importance of differential gene expression and 
characterization of differentially expressed genes has existed for many years. 

* * * 

Although differential expression technologies are applicable to a broad range of 
models, perhaps their most important advantage is that, in most cases, absolutely 
no prior knowledge of the specific genes which are up- or down-regulated is 
required. 

* * * ( 

Whereas it would be informative to know the identity and functionality of all 
genes up/down regulated by . . . toxicants, this would appear a longer term goal 
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.... However, the current use of gene profiling yields a pattern of gene changes 
for a xenobiotic of unknown toxicity which may be matched to that of well 
characterized toxins, thus alerting the toxicologist to possible in vivo similarities 
between the unknown and the standard, thereby providing a platform for more 
extensive toxicological examination, (emphasis added) 

Rockett et al., Differential gene expression in drug metabolism and toxicology: practicalities, 

problems and potential , 29 Xenobiotica No. 7, 655 (1999). 

In another pre-November 5, 1998 article, Lashkari et al. state explicitly that sequences 

that are merely "predicted" to be expressed (predicted Open Reading Frames, or ORFs) - the 

claimed invention in fact is known to be expressed - have numerous uses: 

Efforts have been directed toward the amplification of each predicted ORF or any 
other region of the genome ranging from a few base pairs to several kilobase 
pairs. There are many uses for these amplicons- they can be cloned into standard 
vectors or specialized expression vectors, or can be cloned into other specialized 
vectors such as those used for two-hybrid analysis. The amplicons can also be 
used directly by, for example, arraying onto glass for expression analysis , for 
DNA binding assays, or for any direct DNA assay. 

Lashkari et al., Whole genome analysis: Experimental access to all genome sequenced segments 
through larger-scale efficient oligonucleotide synthesis and PCR , 94 Proc! Nat. Acad. Sci. 8945 
(Aug. 1997) (emphasis added). 

B. The use of nucleic acids coding for proteins expressed by humans as tools for 
toxicology testing, drug discovery, and the diagnosis of disease is now "well- 
established" 

) 

The technologies made possible by expression profiling and the DNA tools upon which 
they rely are now well-established. . The technical literature recognizes not only the prevalence of 
these technologies, but also their unprecedented advantages in drug development, testing and 
safety assessment. These technologies include toxicology testing, as described by Bedilion in his 
Declaration. 

Toxicology testing is now standard practice in the pharmaceutical industry. See, e.g., 

John C. Rockett et al., supra: 

Knowledge of toxin-dependent regulation in target tissues is not solely an academic 
pursuit as much interest has been generated in the pharmaceutical industry to harness this 
technology in the early identification of toxic drug candidates, thereby shortening the 
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developmental process and contributing substantially to the safety assessment of new 
drugs. 

To the same effect are several other scientific publications, including Emile F. Nuwaysir et al., 

Microarravs and Toxicology: The Advent of Toxicogenomics , 24 Molecular Carcinogenesis 153 

(1999); Sandra Steiner and N. Leigh Anderson, Expression profiling in toxicology — potentials 

and limitations , 112-13 Toxicology Letters 467 (2000). 

Nucleic acids useful for measuring the expression of whole classes of genes are routinely 

incorporated for use in toxicology testing. Nuwaysir et al. describes, for example, a Human 

ToxChip comprising 2089 human clones, which were selected 

for their well-documented involvement in basic cellular processes as well as their 
responses to different types of toxic insult. Included on this list are DNA replication and 
repair genes, apoptosis genes, and genes responsive to PAHs and dioxin-like compounds, 
peroxisome proliferators, estrogenic compounds, and oxidant stress. Some of the other 
categories of genes include transcription factors, oncogenes, tumor suppressor genes, 
cyclins, kinases, phosphatases, cell adhesion and motility genes, and homeobox genes. 
Also included in this group are 84 housekeeping genes, whose hybridization intensity is 
averaged and used for signal normalization of the other genes on the chip. 

See also Table 1 of Nuwaysir et al. (listing additional classes of genes deemed to be of special 

interest in making a human toxicology microarray). 

The more genes that are available for use in toxicology testing, the more powerful the 
technique. "Arrays are at their most powerful when they contain the entire genome of the species 
they are being used to study." John C. Rockett and David J. Dix, A pplication of DNA Arrays to 
Toxicology , 107 Environ. Health Perspec.681, No. 8 (1999). Control genes are carefully selected 
for their stability across a large set of array experiments in order to best study the effect of 
toxicological compounds. See attached email from the primary investigator on the Nuwaysir 
paper, Dr. Cynthia Afshari, to an Incyte employee, dated July 3, 2000, as well as the original 
message to which she was responding, indicating that even the expression of carefully selected 
control genes can be altered. Thus, there is no expressed gene which is irrelevant to screening 
for toxicological effects, and all expressed genes have a utility for toxicological screening. 

In fact, the potential benefit to the public, in terms of lives saved and reduced health care 
costs, are enormous. Recent developments provide evidence that the benefits of this information 
are already beginning to manifest themselves. Examples include the following: 
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• In 1999, CV Therapeutics, an Incyte collaborator, was able to use Incyte gene 
expression technology, information about the structure of a known transporter 
gene, and chromosomal mapping location, to identify the key gene associated with 
Tangiers disease. This discovery took place over a matter of only a few weeks, 
due to the power of these new genomics technologies. The discovery received an 
award from the American Heart Association as one of the top 10 discoveries 
associated with heart disease research in 1999. 

• In an April 9, 2000, article published by the Bloomberg news service, an Incyte 
customer stated that it had reduced the time associated with target discovery and 
validation from 36 months to 18 months, through use of Incyte' s genomic 
information database. Other Incyte customers have privately reported similar 
experiences. The implications of this significant saving of time and expense for 
the number of drugs that may be developed and their cost are obvious. 

• In a February 10, 2000, article in the Wall Street Journal, one Incyte customer 
stated that over 50 percent of the drug targets in its current pipeline were derived 
from the Incyte database. Other Incyte customers have privately reported similar 
experiences. By doubling the number of targets available to pharmaceutical 
researchers, Incyte genomic information has demonstrably accelerated the 
development of new drugs. 

Because the Patent Examiner failed to address or consider the "well-established" utilities 
for the claimed invention in toxicology testing, drug development, and the diagnosis of disease, 
the Examiner's rejections should be overturned regardless of their merit. 

C. The Uncontested Fact That the Claimed Polynucleotide Encodes for a 
Protein in the Myosin Family Also Demonstrates Utility 

In addition to having substantial, specific and credible utilities in numerous gene 
expression monitoring applications, it is undisputed that the claimed polynucleotide encodes for 
a protein having the sequence shown as SEQ ID NO:l in the patent application and referred to as 
MHCH in that application. Appellants have demonstrated that MHCH is a member of the 
myosin family, and that the myosin family of proteins includes motor proteins that are involved 
in muscle contraction, intracellular movement, phagocytosis, and cytokinesis. 

The Patent Examiner does not dispute any of the facts set forth in the previous paragraph. 
Neither does the Patent Examiner dispute that, if a polynucleotide encodes for a protein that has a 
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substantial, specific and credible utility, then it follows that the polynucleotide also has a 
substantial, specific and credible utility. 

The Examiner must accept the Appellant's demonstration that the polypeptide encoded by 
the claimed invention is a member of the myosin family and that utility is proven by a reasonable 
probability unless the Examiner can demonstrate through evidence or sound scientific reasoning 
that a person of ordinary skill in the art would doubt utility. See In re Longer, 503 F.2d 1380, 
1391-92, 183 USPQ 288 (CCPA 1974). The Examiner has not provided sufficient evidence or 
sound scientific reasoning to the contrary. 

Nor has the Examiner provided any evidence that any member of the myosin family, let 
alone a substantial number of those members, is not useful. In such circumstances, the only 
reasonable inference is that the polypeptide encoded by the claimed invention must be useful, 
like the other members of the myosin family. 

D. Objective evidence corroborates the utilities of the claimed invention 

There is, in fact, no restriction on the kinds of evidence a Patent Examiner may consider 
in determining whether a "real-world" utility exists. Indeed, "real-world" evidence, such as 
evidence showing actual use or commercial success of the invention, can demonstrate conclusive 
proof of utility. Raytheon v. Roper, 220 USPQ2d 592 (Fed. Cir. 1983); Nestle v. Eugene, 55 
R2d 854, 856, 12 USPQ 335 (6th Cir. 1932). Indeed, proof that the invention is made, used or 
sold by any person or entity other than the patentee is conclusive proof of utility. United States 
Steel Corp. v. Phillips Petroleum Co., 865 F.2d 1247, 1252, 9 USPQ2d 1461 (Fed. Cir. 1989). 

Over the past several years, a vibrant market has developed for databases containing all 
expressed genes (along with the polypeptide translations of those genes), in particular genes 
having medical and pharmaceutical significance such as the instant sequence. (Note that the 
value in these databases is enhanced by their completeness, but each sequence in them is 
independently valuable.) The databases sold by Appellants' assignee, Incyte, include exactly the 
kinds of information made possible by the claimed invention, such as tissue and disease 
associations. Incyte sells its database containing the claimed sequence and millions of other 
sequences throughout the scientific community, including to pharmaceutical companies who use 
the information to develop new pharmaceuticals. 



118391 



15 



09/830,914 



Docket No.: PF-0621 USN 

Both Incyte's customers and the scientific community have acknowledged that Incyte's 
databases have proven to be valuable in, for example, the identification and development of drug 
candidates. As Incyte adds information to its databases, including the information that can be 
generated only as a result of Incyte's discovery of the claimed polynucleotide and its use of that 
polynucleotide on cDNA microarrays, the databases become even more powerful tools. Thus the 
claimed invention adds more than incremental benefit to the drug discovery and development 
process. 

III. The Patent Examiner's Rejections Are Without Merit 

Rather than responding to the evidence demonstrating utility, the Examiner attempts to 
dismiss it altogether by arguing that the disclosed and well-established utilities for the claimed 
polynucleotide are not "specific or substantial" utilities. (Office Action at p. 4). The Examiner is 
incorrect both as a matter of law and as a matter of fact. 

A. The Precise Biological Role Or Function Of An Expressed Polynucleotide Is 
Not Required To Demonstrate Utility 

The Patent Examiner's primary rejection of the claimed invention is based on the ground 
that, without information as to the precise "biological role" of the claimed invention, the claimed 
invention's utility is not sufficiently specific. According to the Examiner, it is not enough that a 
person of ordinary skill in the art could use and, in fact, would want to use the claimed invention 
either by itself or in a cDNA microarray to monitor the expression of genes for such applications 
as the evaluation of a drug's efficacy and toxicity. The Examiner would require, in addition, that 
the Appellant provide a specific and substantial interpretation of the results generated in any 
given expression analysis. 

It may be that specific and substantial interpretations and detailed information on 
biological function are necessary to satisfy the requirements for publication in some technical 
journals, but they are not necessary to satisfy the requirements for obtaining a United States 
patent. The relevant question is not, as the Examiner would have it, whether it is known how or 
why the invention works, In re Cortwright, 165 F.3d 1353, 1359 (Fed. Cir. 1999), but rather 
whether the invention provides an "identifiable benefit" in presently available form. Juicy Whip 
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Inc. v. Orange Bang Inc., 185 F.3d 1364, 1366 (Fed. Cir. 1999). If the benefit exists, and there is 
a substantia] likelihood the invention provides the benefit, it is useful. There can be no doubt, 
particularly in view of the Bedilion Declaration (at, e.g., 10 and 15, Bedilion), that the present 
invention meets this test. 

The threshold for determining whether an invention produces an identifiable benefit is 
low. Juicy Whip, 185 F.3d at 1366. Only those utilities that are so nebulous that a person of 
ordinary skill in the art would not know how to achieve an identifiable benefit and, at least 
according to the PTO guidelines, so-called "throwaway" utilities that are not directed to a person 
of ordinary skill in the art at all, do not meet the statutory requirement of utility. Utility 
Examination Guidelines, 66 Fed. Reg. 1092 (Jan. 5, 2001). 

Knowledge of the biological function or role of a biological molecule has never been 

required to show real-world benefit. In its most recent explanation of its own utility guidelines, 

the PTO acknowledged so much (66 F.R. at 1095): 

[T]he utility of a claimed DNA does not necessarily depend on the function of the 
encoded gene product. A claimed DNA may have specific and substantial utility 
because, e.g., it hybridizes near a disease-associated gene or it has gene-regulating 
activity. 

By implicitly requiring knowledge of biological function for any claimed nucleic acid, the 
Examiner has, contrary to law, elevated what is at most an evidentiary factor into an absolute 
requirement of utility. Rather than looking to the biological role or function of the claimed 
invention, the Examiner should have looked first to the benefits it is alleged to provide. 

B. Membership in a Class of Useful Products Can Be Proof of Utility 

Despite the uncontradicted evidence that the claimed polynucleotide encodes a 
polypeptide in the myosin family, the Examiner refused to impute the utility of the members of 
the myosin family to MHCH. In the Office Action, the Patent Examiner takes the position that, 
unless Appellants can identify which particular biological function within the class of myosins is 
possessed by MHCH, utility cannot be imputed. To demonstrate utility by membership in the 
class of myosins, the Examiner would require that all myosins possess a "common" utility. 

There is no such requirement in the law. In order to demonstrate utility by membership in 
a class, the law requires only that the class not contain a substantial number of useless members. 
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So long as the class does not contain a substantial number of useless members, there is sufficient 
likelihood that the claimed invention will have utility, and a rejection under 35 U.S.C. § 101 is 
improper. That is true regardless of how the claimed invention ultimately is used and whether or 
not the members of the class possess one utility or many. See Brenner v. Manson, 383 U.S. 519, 
532 (1966); Application of Kirk, 376 F.2d 936, 943 (CCPA 1967). 

Membership in a "general" class is insufficient to demonstrate utility only if the class 
contains a sufficient number of useless members such that a person of ordinary skill in the art 
could not impute utility by a substantial likelihood. There would be, in that case, a substantial 
likelihood that the claimed invention is one of the useless members of the class. In the few cases 
in which class membership did not prove utility by substantial likelihood, the classes did in fact 
include predominately useless members. E.g., Brenner (man-made steroids); Kirk (same); Natta 
(man-made polyethylene polymers). 

The Examiner addresses MHCH as if the general class in which it is included is not the 
myosin family, but rather all polynucleotides or all polypeptides, including the vast majority of 
useless theoretical molecules not occurring in nature, and thus not pre-selected by nature to be 
useful. While these "general classes" may contain a substantial number of useless members, the 
myosin family does not. The myosin family is sufficiently specific to rule out any reasonable 
possibility that MHCH would not also be useful like the other members of the family. 

Because the Examiner has not presented any evidence that the myosin class of proteins 
has any, let alone a substantial number, of useless members, the Examiner must conclude that 
there is a "substantial likelihood" that the MHCH encoded by the claimed polynucleotide is 
useful. It follows that the claimed polynucleotide also is useful. 

It is undisputed that known members of the myosin family are motor proteins involved in 
muscle contraction, intracellular movement, phagocytosis, and cytokinesis. A person of ordinary 
skill in the art need not know any more about how the claimed invention functions to use it, and 
the Examiner presents no evidence to the contrary. The Examiner then goes on to assume that 
the only use for MHCH absent knowledge as to how the myosin actually works is further study 
of MHCH itself. 

Not so. As demonstrated by Appellants, knowledge that MHCH is a myosin is more than 
sufficient to make it useful for the diagnosis and treatment of heart and skeletal muscle disorders, 
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developmental disorders, and cell proliferative disorders, including cancer. Indeed, MHCH has 
been shown to be expressed in hematopoietic/immune system, gastrointestinal, musculoskeletal, 
and reproductive tissues, and in tissues associated with cancer (Specification at page 18, lines 12- 
17). The Examiner must accept these facts to be true unless the Examiner can provide evidence 
or sound scientific reasoning to the contrary. But the Examiner has not done so. 

C. Because the uses of polynucleotides encoding MHCH in toxicology testing, 
drug discovery, and disease diagnosis are practical uses beyond mere study 
of the invention itself, the claimed invention has substantial utility. 

The PTO rejected the claims at issue on the ground that the use of an invention as a tool 
for research is not a "substantial" use. Because the PTO's rejection assumes a substantial 
overstatement of the law, and is incorrect in fact, it must be overturned. 

There is no authority for the proposition that use as a tool for research is not a substantial 

utility. Indeed, the Patent Office has recognized that just because an invention is used in a 

research setting does not mean that it lacks utility (MPEP § 2107): 

Many research tools such as gas chromatographs, screening assays, and nucleotide 
sequencing techniques have a clear, specific and unquestionable utility (e.g., they are 
useful in analyzing compounds). An assessment that focuses on whether an invention is 
useful only in a research setting thus does not address whether the specific invention is in 
fact "useful" in a patent sense. Instead, Office personnel must distinguish between 
inventions that have a specifically identified utility and inventions whose specific utility 
requires further research to identify or reasonably confirm. 

The Patent Office's actual practice has been, at least until the present, consistent with that 
approach. It has routinely issued patents for inventions whose only use is to facilitate research, 
such as DNA ligases. These are acknowledged by the PTO's Training Materials themselves to 
be useful, as well as DNA sequences used, for example, as markers. 

Only a limited subset of research uses are not "substantial" utilities: those in which the 
only known use for the claimed invention is to be an object of further study, thus merely inviting 
further research. This follows from Brenner^ in which the U.S. Supreme Court held that a 
process for making a compound does not confer a substantial benefit where the only known use 
of the compound was to be the object of further research to determine its use. Id. at 535. 
Similarly, in Kirk, the Court held that a compound would not confer substantial benefit on the 
public merely because it might be used to synthesize some other, unknown compound that would 
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confer substantial benefit. Kirk, 376 F.2d at 940, 945 ("What Appellants are really saying to 
those in the art is take these steroids, experiment, and find what use they do have as medicines."). 
Nowhere do those cases state or imply, however, that a material cannot be patentable if it has 
some other beneficial use in research. 

As used in toxicology testing, drug discovery, and disease diagnosis, the claimed 
invention has a beneficial use in research other than studying the claimed invention or its protein 
products. It is a tool, rather than an object, of research. The data generated in gene expression 
monitoring using the claimed invention as a tool is not used merely to study the claimed 
polynucleotide itself, but rather to study properties of tissues, cells, and potential drug candidates 
and toxins. Without the claimed invention, the information regarding the properties of tissues, 
cells, drug candidates and toxins is less complete. (Bedilion Declaration at \ 15.) 

The claimed invention has numerous additional uses as a research tool, each of which 
alone is a "substantial utility." These include uses such as diagnostic assays (e.g., pages 36-39), 
chromosomal markers (e.g., pages 39-40), and ligand screening assays (e.g., page 40). 

IV. By Requiring the Patent Appellant to Assert a Particular or Unique Utility, the 
Patent Examination Utility Guidelines and Training Materials Applied by the 
Patent Examiner Misstate the Law 

There is an additional, independent reason to overturn the rejections: to the extent the 
rejections are based on Revised Interim Utility Examination Guidelines (64 FR 71427, 
December 21, 1999), the final Utility Examination Guidelines (66 FR 1092, January 5, 2001) 
and/or the Revised Interim Utility Guidelines Training Materials (USPTO Website 
www.uspto.gov, March 1, 2000), the Guidelines and Training Materials are themselves 
inconsistent with the law. 

The Training Materials, which direct the Examiners regarding how to apply the Utility 

Guidelines, address the issue of specificity with reference to two kinds of asserted utilities: 

"specific" utilities which meet the statutory requirements, and "general" utilities which do not. 

The Training Materials define a "specific utility" as follows: 

A [specific utility] is specific to the subject matter claimed. This contrasts to general 
utility that would be applicable to the broad class of invention. For example, a claim to a 
polynucleotide whose use is disclosed simply as "gene probe" or "chromosome marker" 
would not be considered to be specific in the absence of a disclosure of a specific DNA 
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target. Similarly, a general statement of diagnostic utility, such as diagnosing an 
unspecified disease, would ordinarily be insufficient absent a disclosure of what condition 
can be diagnosed. 

The Training Materials distinguish between "specific" and "general" utilities by assessing 
whether the asserted utility is sufficiently "particular," i.e., unique (Training Materials at p.52) as 
compared to the "broad class of invention." (In this regard, the Training Materials appear to 
parallel the view set forth in Stephen G. Kunin, Written Description Guidelines and Utility 
Guidelines , 82 J.P.T.O.S. 77, 97 (Feb, 2000) ("With regard to the issue of specific utility the 
question to ask is whether or not a utility set forth in the specification is particular to the claimed 
invention.")). 

Such "unique" or "particular" utilities never have been required by the law. To meet the 
utility requirement, the invention need only be "practically useful," Natta, 480 F.2d 1 at 1397, 
and confer a "specific benefit" on the public. Brenner, 383 U.S. at 534. Thus, incredible "throw- 
away" utilities, such as trying to "patent a transgenic mouse by saying it makes great snake food," 
do not meet this standard. Karen Hall; Genomic Warfare , The American Lawyer 68 (June 2000) 
(quoting John Doll, Chief of the Biotech Section of USPTO). 

This does not preclude, however, a general utility, contrary to the statement in the 
Training Materials where "specific utility" is defined (page 5). Practical real-world uses are not 
limited to uses that are unique to an invention. The law requires that the practical utility be: 
"definite," not particular. 

Montedison, 664 F.2d at 375. Appellant is not aware of any court that has rejected an assertion 
of utility on the grounds that it is not "particular" or "unique" to the specific invention. Where 
courts have found utility to be too "general," it has been in those cases in which the asserted 
utility in the patent disclosure was not a practical use that conferred a specific benefit. That is, a 
person of ordinary skill in the art would have been left to guess as to how to benefit at all from 
the invention. In Kirk, for example, the CCPA held the assertion that a man-made steroid had 
"useful biological activity" was insufficient where there was no information in the specification 
as to how that biological activity could be practically used. Kirk, 376 F.2d at 941. 

The fact that an invention can have a particular use does not provide a basis for requiring 
a particular use. See Brana, supra (disclosure describing a claimed antitumor compound as 
being homologous to an antitumor compound having activity against a "particular" type of cancer 

118391 21 09/830,914 



Docket No.: PF-0621 USN 

was determined to satisfy the specificity requirement). "Particularity" is not and never has been 
the sine qua non of utility; it is, at most, one of many factors to be considered. 

As described supra, broad classes of inventions can satisfy the utility requirement so long 
as a person of ordinary skill in the art would understand how to achieve a practical benefit from 
knowledge of the class. Only classes that encompass a significant portion of nonuseful members 
would fail to meet the utility requirement. Supra § EI.B.2 {Montedison, 664 F.2d at 374-75). 

The Training Materials fail to distinguish between broad classes that convey information 
of practical utility and those that do not, lumping all of them into the latter, unpatentable category 
of "general" utilities. As a result, the Training Materials paint with too broad a brush. Rigorous- 
ly applied, they would render unpatentable whole categories of inventions that heretofore have 
been considered to be patentable and that have indisputably benefitted the public, including the 
claimed invention. See supra § II.B. Thus the Training Materials cannot be applied consistently 
with the law. 

Issue 2 - Whether claims 23-31 meet the enablement requirement of 35 ILS.C. § 112„ first 
paragraph 

To the extent the rejection of the claimed invention under 35 ILS.C. § 112, first 
paragraph, is based on the improper rejection for lack of utility under 35 U.S.C. § 101, it 
must be reversed. 

The rejection set forth in the Office Action is based on the assertions discussed above, 
i.e., that the claimed invention lacks patentable utility. To the extent that the rejection under 
§ 1 12, first paragraph, is based on the improper allegation of lack of patentable utility under 
§ 101, it fails for the same reasons. 
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Issue 3- Whether one of ordinary skill in the art would know how to make the claimed 
polynucleotide variants and fragments according to claims 23, 26-28, and 30, so as to satisfy 
the enablement requirement of 35 ILS.C. §112, first paragraph 

The rejection of claims 23, 26, 27, 28, and 30 is improper because the claimed 
variants and fragments of SEQ ID NO:l are amply enabled by the disclosure of the 
specification. 

In particular, the Examiner alleges that "searching for the specific nucleotides to change 
(deletion, insertion, substitution, or combinations thereof) in a polynucleotide to make any 
polynucleotide of any nucleotide sequence having 70% identity to any polynucleotide encoding 
SEQ ID NO: 1 or any fragment thereof or any polynucleotide having 70% identity to SEQ ID 
NO:2 or any fragment thereof is well outside the realm of routine experimentation and 
predictability in the art..." (Office Action of April 4, 2003, page 5). 

As set forth in In re Marzocchi, 169 USPQ 367, 369 (CCPA 1971): 

The first paragraph of § 112 requires nothing more than objective enablement. 
How such a teaching is set forth, either by the use of illustrative examples or by 
broad terminology, is of no importance. 

As a matter of Patent Office practice, then, a specification disclosure which 
contains a teaching of the manner and process of making and using the invention 
in terms which correspond in scope to those used in describing and defining the 
subject matter sought to be patented must be take as in compliance with the 
enabling requirement of the first paragraph of § 112 unless there is reason to 
doubt the objective truth of the statements contained therein which must be relied 
on for enabling support. 

Appellants respectfully point out that the claims of the instant application are drawn to 
naturally-occurring variants. Thus it is not necessary to screen every conceivable variant which 
might be made using recombinant methods, as all that is claimed are those variant sequences 
which are found in nature. Given the sequences of SEQ ID NO:l and SEQ ID NO:2, one of 
ordinary skill in the art could readily identify a polynucleotide encoding a polypeptide 
comprising a naturally occurring amino acid sequence at least 90% identical to an amino acid 
sequence of SEQ ID NO:l or a polynucleotide comprising a naturally occurring polynucleotide 
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sequence at least 70% identical to a polynucleotide sequence of SEQ ID NO: 2, using well known 
methods of sequence analysis without any undue experimentation. For example, the 
identification of relevant polynucleotides could be performed by hybridization and/or PCR 
techniques that were well-known to those skilled in the art at the time the subject application was 
filed and/or described throughout the Specification of the instant application. See, e.g., page 12, 
line 13 through page 13, line 9; page 25, lines 2-6 and 18-28; and Example VI at pages 45-46. 
Thus, one skilled in the art need not make and test vast numbers of polynucleotides. Instead, one 
skilled in the art need only screen a cDNA library or use appropriate PCR conditions to identify 
relevant polynucleotides that already exist in nature. The skilled artisan would also know how to 
use the claimed polynucleotides, for example in expression profiling, disease diagnosis, or 
detection of related sequences as discussed above. 

The specification also describes the expression vectors into which the claimed fragments 
could be inserted, and the construction of fusion proteins (pages 22-24 and page 47, line 8 
through page 48, line 3). The specification describes, for example, specific assays for myosin 
activity on page 48; binding assays to detect molecular interactions of "MHCH or biologically 
active fragments thereof on page 50, lines 4-19; and immunological methods for detecting and 
measuring MHCH on page 25, lines 7-16. These methods could be used to detect and 
characterize peptide variants and fragments of SEQ ID NO:l. Given this guidance, one of 
ordinary skill in the art would readily understand how to select and screen polynucleotides 
encoding fragments of SEQ ID NO:l with ATPase activity or immunogenic activity without any 
undue experimentation. 

Furthermore, the claims are directed to polynucleotides , not polypeptides, and it is the 
functionality of the claimed polynucleotides, not the polypeptides encoded by them, that is 
relevant. Members of the claimed genus of variants may include, for example, mutant alleles 
associated with diseases, or single nucleotide polymorphisms (SNPs). Members'of the claimed 
genus of variants may be useful even if they encode defective MHCH polypeptides. For 
example, the variant polynucleotides could be used for the detection of sequences related to 
MHCH (see the specification at page 25, lines 17-28, and page 36, lines 24-30) including MHCH 
variants that may be associated with disease states, such as the diseases listed on page 27, line 16 
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through page 28, line 3, of the specification. See the specification at, for example, pages 36-40 
for disclosure of how to use the claimed sequences in diagnostic assays. 

The Examiner has cited Attwood et al., identifying some of the difficulties that may be 
involved in predicting protein function; however, this reference does not suggest that functional 
homology cannot be inferred by a reasonable probability in this case. At most, this article 
suggests that it is difficult to make predictions about function with certainty. The standard 
applicable in this case is not proof to certainty, but rather, proof to a reasonable probability. In 
fact, Attwood et al. point out the value of sequence analysis, in particular with regard to the 
identification of conserved motifs in proteins. "Because motifs usually reflect some vital 
structural or functional role (Fig. 2), they effectively provide diagnostic family signatures" 
(Emphasis added; Attwood et al. p. 332, col. 2). 

The Examiner alleges that "the specification does not teach the specific 
structural/catalytic amino acids and the structural motifs essential for protein activity/function 
which cannot be altered. Such experimentation entails selecting specific nucleotides to change 
(deletion, insertion, substitution, or combinations thereof) in a polynucleotide to make the 
claimed polynucleotide and determining by assays whether the polypeptide has activity" (Final 
Office Action, page 4). This is untrue. Again, Appellants respectfully point out that the claims 
of the instant application are drawn to naturally-occurring variants. Through the process of 
natural selection, nature will have determined the appropriate sequences. 

Appellants also point out that the specification does teach "specific structural motifs 
essential for protein activity/function." The specification describes similarities between SEQ ID 
NO:l and C. elegans myosin (gl279777) and H. annuus unconventional myosin (g2444174), 
including the presence of the myosin head domain, myosin heavy chain, and light chain binding 
site signatures (see specification, for example, at page 17, line 26 through page 18, line 9 and 
Figure 2). The myosin head domain is known to possess ATPase activity and contain actin 
binding sites. At the time of filing of the instant application, the crystallographic structure of a 
myosin motor head was available to assist one of skill in the art in the determination of "specific 
catalytic residues and structural motifs," particularly those critical for ATPase activity and actin 
binding (See reference of Rayment et al. (1993) Science 261:50-58, previously submitted with 
the response to the Office Action of April 4, 2003). 
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Further, the Examiner requires working examples (Office Action, page 4). There is no 
such requirement under the law to provide "working examples." As set forth in In re Borkowski, 
164 USPQ 642, 645 (CCPA 1970) (footnote omitted): 

However, as we have stated in a number of opinions, a specification need not contain a 
working example if the invention is otherwise disclosed in such a manner that one skilled 
in the art will be able to practice it without an undue amount of experimentation. 

See also M.P.E.P. 2164.02 as follows: 

Compliance with the enablement requirement of 35 U.S.C. 112, first paragraph, does not 
turn on whether an example is disclosed. An example may be "working" or "prophetic"... 
A prophetic example describes an embodiment of the invention based on predicted results 
rather than work actually conducted or results actually achieved. 

Thus, there is no requirement under the law to provide "working examples" of what is 
claimed. Rather, one looks to whether the specification provides a description of how to make 
what is claimed. The present specification provides the requisite description. 

Contrary to the standard set forth in Marzocchi and Borkowski, the Examiner has failed to 
provide any reasons why one would doubt that the guidance provided by the present 
specification would enable one to make and use the recited polynucleotides. Hence, a prima 
facie case for non-enablement has not been established. For at least the above reasons, 
withdrawal of the enablement rejections under 35 U.S.C. § 112, first paragraph, is respectfully 
requested. 

Issue 4- Whether claims 23, 26-28, 30, and 31 meet the written description requirement of 
35 U.S.C. §112, first paragraph 

The rejection of claims 23, 26-28, and 31 is improper because the Specification 
provides an adequate written description of the claimed variants and fragments of SEQ ID 
NO:2. 
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The requirements necessary to fulfill the written description requirement of 35 U.S.C. 

1 12, first paragraph, are well established by case law. 

. . . the applicant must also convey with reasonable clarity to those skilled 
in the art that, as of the filing date sought, he or she was in possession of the 
invention. The invention is, for purposes of the "written description" inquiry, 
whatever is now claimed. Vas-Cath, Inc. v. Mahurkar, 19 USPQ2d 1111, 1117 
(Fed. Cir. 1991) 

Attention is also drawn to the Patent and Trademark Office's own "Guidelines for 

Examination of Patent Applications Under the 35 U.S.C. Sec. 112, para. 1", published January 5, 

2001, which provide that : 

An applicant may also show that an invention is complete by disclosure of 
sufficiently detailed, relevant identifying characteristics 42 which provide evidence 
that applicant was in possession of the claimed invention, 43 i.e., complete or partial 
structure, other physical and/or chemical properties, functional characteristics 
when coupled with a known or disclosed correlation between function and 
structure, or some combination of such characteristics. 44 What is conventional or 
well known to one of ordinary skill in the art need not be disclosed in detail. 45 If a 
skilled artisan would have understood the inventor to be in possession of the 
claimed invention at the time of filing, even if every nuance of the claims is not 
explicitly described in the specification, then the adequate description requirement 
is met. 46 

Thus, the written description standard is fulfilled by both what is specifically disclosed 
and what is conventional or well known to one skilled in the art. 

SEQ ID NO:l and SEQ ID NO:2 are specifically disclosed in the application (see, for 
example, page 17, lines 19-34). Variants of SEQ ID NO:l and SEQ ID NO:2 are described, 1 for 
example, at page 18, lines 18-33. Incyte clones in which the nucleic acids encoding the human 
myosin heavy chain homolog were first identified and libraries from which those clones were 
isolated are described, for example, at page 17, lines 19-25 of the Specification. Chemical and 
structural features of SEQ ID NO:l are described, for example, on page 17, line 26 through page 
18, line 9. Given SEQ ID NO:l, one of ordinary skill in the art would recognize naturally- 
occurring variants of SEQ ID NO:l having 90% sequence identity to SEQ ID NO:l. Given SEQ 
ID NO:2, one of ordinary skill in the art would recognize naturally-occurring variants of SEQ ED 
NO:2 having 70% sequence identity to SEQ ID NO:2. Accordingly, the Specification provides . 
an adequate written description of the recited polypeptide sequences. 
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The Office Action has further asserted that the claims are not supported by an adequate 
written description because the specification "only provides the following representative species 
encompassed by these claims: a polynucleotide consisting of the nucleotide sequence of SEQ ID 
NO: 2 and a polynucleotide encoding a polypeptide consisting of the amino acid sequence of SEQ 
ID NO:L... Given this lack of additional representative species as encompassed by the claims, 
Applicants have failed to sufficiently describe the claimed invention"... 

Such a position is believed to present a misapplication of the law. 

1. The present claims specifically define the claimed genus through the 
recitation of chemical structure 

Court cases in which "DNA claims" have been at issue commonly emphasize that the 
recitation of structural features or chemical or physical properties are important factors to 
consider in a written description analysis of such claims. For example, in Fiers v. Revel, 25 
USPQ2d 1601, 1606 (Fed. Cir. 1993); the court stated that: 

If a conception of a DNA requires a precise definition, such as by structure, 

formula, chemical name or physical properties, as we have held, then a description 

also requires that degree of specificity. 

In a number of instances in which claims to DNA have been found invalid, the courts 

have noted that the claims attempted to define the claimed DNA in terms of functional 

characteristics without any reference to structural features. As set forth by the court in University 

of California v. Eli Lilly and Co., 43 USPQ2d 1398, 1406 (Fed. Cir. 1997): 

In claims to genetic material, however, a generic statement such as "vertebrate 
insulin cDNA M or "mammalian insulin cDNA," without more, is not an adequate 
written description of the genus because it does not distinguish the claimed genus 
from others, except by function. 

Thus, the mere recitation of functional characteristics of a DNA, without the definition of 
structural features, has been a common basis by which courts have found invalid claims to DNA. 
For example, in Lilly, 43 USPQ2d at 1407, the court found invalid for violation of the written 
description requirement the following claim of U.S. Patent No. 4,652,525: 

1. A recombinant plasmid replicable in procaryotic host containing within its 
nucleotide sequence a subsequence having the structure of the reverse transcript of 
an mRNA of a vertebrate, which mRNA encodes insulin. 
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In Fiers, 25 USPQ2d at 1603, the parties were in an interference involving the following 

count: 

A DNA which consists essentially of a DNA which codes for a human fibroblast 
interferon-beta polypeptide. 

Party Revel in the Fiers case argued that its foreign priority application contained an 
adequate written description of the DNA of the count because that application mentioned a 
potential method for isolating the DNA. The Revel priority application, however, did not have a 
description of any particular DNA structure corresponding to the DNA of the count. The court 
therefore found that the Revel priority application lacked an adequate written description of the 
subject matter of the count. 

Thus, in Lilly and Fiers, nucleic acids were defined on the basis of functional 
characteristics and were found not to comply with the written description requirement of 35 
U.S.C. §112; i.e. 9 "an mRNA of a vertebrate, which mRNA encodes insulin" in Lilly, and "DNA 
which codes for a human fibroblast interferon-beta polypeptide" in Fiers. In contrast to the 
situation in Lilly and Fiers, the claims at issue in the present application define polynucleotides 
in terms of chemical structure, rather than on functional characteristics. For example, the 
"variant language" of independent claim 30 recites chemical structure to define the claimed 
genus: 

An isolated polynucleotide selected from the group consisting of:... 

b) a polynucleotide comprising a naturally occurring polynucleotide sequence at 

least 70% identical to a polynucleotide sequence of SEQ ID NO:2... 

From the above it should be apparent that the claims of the subject application are 
fundamentally different from those found invalid in Lilly and Fiers. The subject matter of the 
present claims is defined in terms of the chemical structure of SEQ ID NO:2. In the present case, 
there is no reliance merely on a description of functional characteristics of the polynucleotides 
recited by the claims. In fact, there is no recitation of functional characteristics. Moreover, if 
such functional recitations were included, it would add to the structural characterization of the 
recited polynucleotides. The polynucleotides defined in the claims of the present application 
recite structural features, and cases such as Lilly and Fiers stress that the recitation of structure is 
an important factor to consider in a written description analysis of claims of this type. By failing 
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to base its written description inquiry "on whatever is now claimed," the Office Action failed to 
provide an appropriate analysis of the present claims and how they differ from those found not to 
satisfy the written description requirement in Lilly and Fiers 

2. The present claims do not define a genus which is "highly variant" 

Furthermore, the claims at issue do not describe a genus which could be characterized as 
"highly variant." Available evidence illustrates that the claimed genus is of narrow scope. 

In support of this assertion, the Examiner's attention is directed to the enclosed reference 
by Brenner et al. ("Assessing sequence comparison methods with reliable structurally identified 
distant evolutionary relationships," Proc. Natl. Acad. Sci. USA (1998) 95:6073-6078). Through 
exhaustive analysis of a data set of proteins with known structural and functional relationships 
and with <90% overall sequence identity, Brenner et al. have determined that 30% identity is a 
reliable threshold for establishing evolutionary homology between two sequences aligned over at 
least 150 residues. (Brenner et al., pages 6073 and 6076.) Furthermore, local identity is 
particularly important in this case for assessing the significance of the alignments, as Brenner et 
al. further report that ^40% identity over at least 70 residues is reliable in signifying homology 
between proteins. (Brenner et al., page 6076.) 

The present application is directed, inter alia, to myosin proteins related to the amino 
acid sequence of SEQ ID NO:l. In accordance with Brenner et al, naturally occurring molecules 
may exist which could be characterized as myosin proteins and which have as little as 40% 
identity over at least 70 residues to SEQ ID NO:l. The "variant language" of the present claims 
recites, for example, a polynucleotide encoding "a polypeptide comprising a naturally occurring 
amino acid sequence at least 90% identical to an amino acid sequence of SEQ ID NO:l" and "a 
polynucleotide comprising a naturally occurring polynucleotide sequence at least 70% identical 
to a polynucleotide sequence of SEQ ID NO:2" (note that SEQ ID NO:l has 612 amino acid 
residues). This variation is far less than that of all potential myosin proteins related to SEQ ID 
NO:l, i.e., those myosin proteins having as little as 40% identity over at least 70 residues to SEQ 
IDNO:l. 
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3. The state of the art at the time of the present invention is further advanced 
than at the time of the Lilly and Fiers applications 

In the Lilly case, claims of U.S. Patent No. 4,652,525 were found invalid for failing to 
comply with the written description requirement of 35 U.S.C. §112. The '525 patent claimed the 
benefit of priority of two applications, Application Serial No. 801,343 filed May 27, 1977, and 
Application Serial No. 805,023 filed June 9, 1977. In the Fiers case, party Revel claimed the 
benefit of priority of an Israeli application filed on November 21, 1979. Thus, the written 
description inquiry in those case was based on the state of the art at essentially at the "dark ages" 
of recombinant DNA technology. 

The present application has a priority date of November 5, 1998. Much has happened in 
the development of recombinant DNA technology in the 19 or more years from the time of filing 
of the applications involved in Lilly and Fiers and the present application. For example, the 
technique of polymerase chain reaction (PCR) was invented. Highly efficient cloning and DNA 
sequencing technology has been developed. Large databases of protein and nucleotide sequences 
have been compiled. Much of the raw material of the human and other genomes has been 
sequenced. With these remarkable advances one of skill in the art would recognize that, given 
the sequence information of SEQ ID NO:l and SEQ ID NO:2, and the additional extensive detail 
provided by the subject application, the present inventors were in possession of the claimed 
polynucleotide variants at the time of filing of this application. 

CONCLUSIONS 

Appellants respectfully submit that rejections for lack of utility based, inter alia, on an 
allegation of "lack of specificity," as set forth in the Office Action and as justified in the Revised 
Interim and final Utility Guidelines and Training Materials, are not supported in the law. Neither 
are they scientifically correct, nor supported by any evidence or sound scientific reasoning. 
These rejections are alleged to be founded on facts in court cases such as Brenner and Kirk, yet 
those facts are clearly distinguishable from the facts of the instant application, and indeed most if 
not all nucleotide and protein sequence applications. Nevertheless, the PTO is attempting to 
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mold the facts and holdings of these prior cases, "like a nose of wax," 2 to target rejections of 
claims to polypeptide and polynucleotide sequences, as well as to claims to methods of detecting 
said polynucleotide sequences, where biological activity information has not been proven by 
laboratory experimentation, and they have done so by ignoring perfectly acceptable utilities fully 
disclosed in the specifications as well as well-established utilities known to those of skill in the 
art. As is disclosed in the specification, and even more clearly, as one of ordinary skill in the art 
would understand, the claimed invention has well-established, specific, substantial and credible 
utilities. The rejections are, therefore, improper and should be reversed. 

Moreover, to the extent the above rejections were based on the Revised Interim and final 
Examination Guidelines and Training Materials, those portions of the Guidelines and Training 
Materials that form the basis for the rejections should be determined to be inconsistent with the 
law. 

The written description rejections under 35 U.S.C. § 112, first paragraph, should be 
reversed based on at least the arguments presented above. The Examiner failed to base the 
written description inquiry "on whatever is now claimed." Consequently, the Examiner did not 
provide an appropriate analysis of the present claims and how they differ from those found not to 
satisfy the written description requirement in cases such as Lilly and Fiers. In particular, the 
claims of the subject application are fundamentally different from those found invalid in Lilly 
and Fiers. The subject matter of the present claims is defined in terms of the chemical structure 
of SEQ ID NO:l and SEQ ID NO:2. The courts have stressed that structural features are 
important factors to consider in a written description analysis of claims to nucleic acids and 
proteins. In addition, the genus of polypeptides defined by the present claims is adequately 
described, as evidenced by Brenner et al. Furthermore, there have been remarkable advances in 
the state of the art since the Lilly and Fiers cases, and these advances were given no 
consideration whatsoever in the position set forth by the Examiner. 

Due to the urgency of this matter, including its economic and public health implications, 
an expedited review of this appeal is earnestly solicited. 



2 "The concept of patentable subject matter under §101 is not 'like a nose of wax which 
may be turned and twisted in any direction * * *.' White v. Dunbar, 1 19 U.S. 47, 51." (Parker v. 
FlooK 198 USPQ 193 (US SupCt 1978)) 
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If the USPTO determines that any additional fees are due, the Commissioner is hereby 
authorized to charge Deposit Account No. 09-0108. 
This brief is enclosed in triplicate 

Respectfully submitted, 
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APPENDIX - CLAIMS ON APPEAL 



23. An isolated polynucleotide encoding a polypeptide selected from the group 
consisting of: 

a) a polypeptide comprising an amino acid sequence of SEQ ID NO: 1, 

b) a polypeptide comprising a naturally occurring amino acid sequence at least 90% 
identical to an amino acid sequence of SEQ ID NO:l, said polypeptide having 
ATPase activity, 

c) a biologically active fragment of a polypeptide having an amino acid sequence of 
SEQ ID NO:l, said fragment having ATPase activity, and 

d) an immunogenic fragment of a polypeptide consisting of an amino acid sequence 
of SEQ ID NO:l, wherein said fragment comprises at least 15 contiguous amino 
acid residues of SEQ ID NO: 1 . 

24. An isolated polynucleotide of claim 23 comprising a polynucleotide encoding a 
polypeptide comprising an amino acid sequence of SEQ ID NO: 1 . 

25. An isolated polynucleotide of claim 24 comprising a polynucleotide sequence of 
SEQ ID NO:2. 

26. A recombinant polynucleotide comprising a promoter sequence operably linked to a 
polynucleotide of claim 23. 

27. A cell transformed with a recombinant polynucleotide of claim 26. 

28. A method of producing a polypeptide, the method comprising: 

a) culturing a cell under conditions suitable for expression of the polypeptide, 
wherein said cell is transformed with a recombinant polynucleotide, and said 
recombinant polynucleotide comprises a promoter sequence operably linked to a 
polynucleotide of claim 23, and 
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b) recovering the polypeptide so expressed. 

29. A method of claim 28, wherein the polypeptide comprises an amino acid sequence 
of SEQIDNO:l. 

30. An isolated polynucleotide selected from the group consisting of: 

a) a polynucleotide comprising a polynucleotide sequence of SEQ ID NO:2, 

b) a polynucleotide comprising a naturally occurring polynucleotide sequence at least 
70% identical to a polynucleotide sequence of SEQ ID NO:2, 

c) a polynucleotide complementary to a polynucleotide of a), 

d) a polynucleotide complementary to a polynucleotide of b), and 

e) an RNA equivalent of a)-d). 

31. An isolated polynucleotide consisting of at least 25 contiguous nucleotides of a 
polynucleotide selected from the group consisting of: 

a) a polynucleotide consisting of a polynucleotide sequence of SEQ ID NO:2, and 

b) a polynucleotide complementary to a polynucleotide of a). 
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1 . An important feature of the work of many molecular biologists is identifying which 
genes are switched on and off in a cell under different environmental conditions or 
subsequent to xenobiotic challenge. Such information has many uses, including the 
deciphering of molecular pathways and facilitating the development of new experimental 
and diagnostic procedures. However, the student of gene hunting should be forgiven for 
perhaps becoming confused by the mountain of information available as there appears to be 
almost as many methods of discovering differentially expressed genes as there are research 
groups using the technique. 

2 . The aim of this review was to clarify the main methods of diff erential gene expression 
analysis and the mechanistic principles underlying them. Also included is a discussion on 
some of the practical aspects of using this technique. Emphasis is placed on the so-called 
* open ' systems, which require no prior knowledge of the genes contained within the study 
model. Whilst these will eventually be replaced by ' closed ' systems in the study of human, 
mouse and other commonly studied laboratory animals, they will remain a powerful tool for 
those examining less fashionable models. 

3. The use of suppression-PCR subtractive hybridization is exemplified in the 
identification of up- and down- regulated genes in rat liver following exposure to pheno- 
barbital, a well-known inducer of the drug metabolizing enzymes. 

4. Differential gene display provides a coherent platform for building libraries and 
microchip arrays of 'gene fingerprints' characteristic of known enzyme inducers and 
xenobiotic toxicants, which may be interrogated subsequently for the identification and 
characterization of xenobiotics of unknown biological properties. 



Introduction 

It is now apparent that the development of almost all cancers and many non- 
neoplastic diseases are accompanied by altered gene expression in the affected cells 
compared to their normal state (Hunter 1991, Wynford-Thomas 1991, Vogelstein 
and Kinzler 1993, Semenza 1994, Cassidy 1995, Kleinjan and Van Hegningen 1998). 
Such changes also occur in response to external stimuli such as pathogenic micro- 
organisms (Rohn et d. 1996, Singh et al. 1997, Griffin and Krishna 1998, Lunney 
1998) and xenobiotics (Sewall et al 1995, Dogra et al 1998, Ramana and Kohli 
1998), as well as during the development of undifferentiated cells (Hecht 1998, 
Rudin and Thompson 1998, Schneider-Maunoury et al. 1998). The potential 
medical and therapeutic benefits of understanding the molecular changes which 
occur in any given cell in progressing from the normal to the 'altered' state are 
enormous. Such profiling essentially provides a 'fingerprint' of each step of a 
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cell's development or response and should help in the elucidation of specific and 
sensitive biomarkers representing, for example, different types of cancer or previous 
exposure to certain classes of chemicals that are enzyme inducers. 

In drug metabolism, many of the xenobiotic-metabolizing enzymes (including 
the well-characterized isoforms of cytochrome P450) are inducible by drugs and 
chemicals in man (Pelkonen et al. 1998), predominantly involving transcriptional 
activation of not only the cognate cytochrome P450 genes, but additional cellular 
proteins which may be crucial to the phenomenon of induction. Accordingly, the 
development of methodology to identify and assess the full complement of genes 
that are either up- or down-regulated by inducers are crucial in the development of 
knowledge to understand the precise molecular mechanisms of enzyme induction 
and how this relates to drug action. Similarly, in the field of chemical-induced 
toxicity, it is now becoming increasingly obvious that most adverse reactions to 
drugs and chemicals are the result of multiple gene regulation, some of which are 
causal and some of which are casually-related to the toxicological phenomenon per 
se. This observation has led to an upsurge in interest in gene-profiling technologies 
which differentiate between the control and toxin-treated gene pools in target tissues 
and is, therefore, of value in rationalizing the molecular mechanisms of xenobiotic-, 
induced toxicity. Knowledge of toxin-dependent gene regulation in target tissues is 
not solely an academic pursuit as much interest has been generated in the 
pharmaceutical industry to harness this technology in the early identification of toxic 
drug candidates, thereby shortening the developmental process and contributing 
substantially to the safety assessment of new drugs. For example, if the gene profile 
in response to say a testicular toxin that has been well-characterized in vivo could be 
determined in the testis, then this profile would be representative of all new drug 
candidates which act via this specific molecular mechanism of toxicity, thereby 
providing a useful and coherent approach to the early detection of such toxicants. 
Whereas it would be informative to know the identity and functionality of all genes 
up/ down regulated by such toxicants, this would appear a longer term goal, as the 
majority of human genes have not yet been sequenced, far less their functionality 
determined. However, the current use of gene profiling yields a pattern of gene 
changes for a xenobiotic of unknown toxicity which may be matched to that of well- 
characterized toxins, thus alerting the toxicologist to possible m vivo similarities 
between the unknown and the standard, thereby providing a platform for more 
extensive toxicological examination. Such approaches are beginning to gain 
momentum, in that several biotechnology companies are commercially producing, 
'gene chips' or 'gene arrays' that may be interrogated for toxicity assessment of 
xenobiotics. These chips consist of hundreds/ thousands of genes, some of which are 
degenerate in the sense that not all of the genes are mechanistically-related to any 
one toxicological phenomenon. Whereas these chips are useful in broad-spectrum 
screening, they are maturing at a substantial rate, in that gene arrays are now 
becoming more specific, e.g. chips for the identification of changes in growth factor 
families that contribute to the aetiology and development of chemically-induced 
neoplasias. 

Although documenting and explaining these genetic changes presents a 
formidable obstacle to understanding the different mechanisms of development and 
disease progression, the technology is now available to begin attempting this difficult 
challenge. Indeed, several 'differential expression analysis' methods have been 
developed which facilitate the identification of gene products that demonstrate 
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altered expression in cells of one population compared to another. These methods 
have been used to identify differential gene expression in many situations, including 
invading pathogenic microbes (Zhao et al. 1998), in cells responding to extracellular 
and intracellular microbial invasion (Duguid and Dinauer 1990, Ragno et al. 1997, 
Maldarelli et al. 1998), in chemically treated cells (Syed et al. 1997, Rockett et al. 
1999), neoplastic cells (Liang et al. 1992, Chang and Terzaghi-Howe 1998), 
activated cells (Gurskaya et al. 1996, Wan et al. 1996), differentiated cells (Hara et 
al. 1991, Guimaraes et al. 1995a, b), and different cell types (Davis et al. 1984, 
Hedrick et al. 1984, Xhu et al. 1998). Although differential expression analysis 
technologies are applicable to a broad range of models, perhaps their most important 
advantage is that, in most cases, absolutely no prior knowledge of the specific genes 
which are up- or down-regulated is required. 

The field of differential expression analysis is a large and complex one, with 
many techniques available to the potential user. These can be categorized into 
several methodological approaches, including: 

(1) Differential screening, 

(2) Subtractive hybridization (SH) (includes methods such as chemical cross- 
linking subtraction — CCLS, suppression-PCR subtractive hybridization— 
SSH, and representational difference analysis — RDA), 

(3) Differential display (DD), 

(4) Restriction endonuclease facilitated analysis (including serial analysis of gene 
expression — SAGE — and gene expression fingerprinting— GEF), 

(5) Gene expression arrays, and 

(6) Expressed sequence tag (EST) analysis. 

The above approaches have been used successfully to isolate differentially 
expressed genes in different model systems. However, each method has its own 
subtle (and sometimes not so subtle) characteristics which incur various advantages 
and disadvantages. Accordingly, it is the purpose of this review to clarify the 
mechanistic principles underlying the main differential expression methods and to 
highlight some of the broader considerations and implications of this very powerful 
and increasingly popular technique. Specifically, we will concentrate on the so- 
called 'open' systems, namely those which do not require any knowledge of gene 
sequences and, therefore, are useful for isolating unknown genes. Two * closed* 
systems (those utilising previously identified gene sequences), EST analysis and the 
use of DNA arrays, will also be considered briefly for completeness; Whilst 
emphasis will often be placed on suppression PCR subtractive hybridization (SSH, 
the approach employed in this laboratory), it is the aim of the authors to highlight, 
wherever possible, those areas of common interest to those who use, or intend to use, 
differential gene expression analysis. 

Differential cDNA library screening (DS) 

Despite the development of multiple technological advances which have recently 
brought the field of gene expression profiling to the forefront of molecular analysis, 
recognition of the importance of differential gene expression and characterization of 
differentially expressed genes has existed for many years. One of the original 
approaches used to identify such genes was described 20 years ago by St John and 
Davis (1979). These authors developed a method, termed 'differential plaque filter 
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hybridization', which was used to isolate galactose-inducible DNA sequences from 
yeast. The theory is simple: a genomic DNA library is prepared from normal, 
unstimulated cells of the test organism/tissue and multiple filter replicas are 
prepared. These replica blots are probed with radioactively (or otherwise) labelled . 
complex cDNA probes prepared from the control and test cell mRNA populations. 
Those mRNAs which are differentially expressed in the treated cell population will 
show a positive signal only on the filter probed with cDNA from the treated cells. 
Furthermore, labelled cDNA from different test conditions can be used to probe 
multiple blots, thereby enabling the identification of mRNAs which are only up- , 
regulated under certain conditions. For example, St John and Davis (1 979) screened 
replica filters with acetate-, glucose- and galactose-derived probes in order to obtain 
genes induced specifically by galactose metabolism. Although groundbreaking in its 
time this method is now considered insensitive and time-consuming, as up to 2 
months are required to complete the identification of genes which are differentially 
expressed in the test population. In addition, there is no convenient way to check 
that the procedure has worked until the whole process has been completed. 

Subtractive Hybridization (SH) 

The developing concept of differential gene expression and the success of early 
approaches such as that described by St John and Davis (1979) soon gave rise to a 
search for more convenient methods of analysis. One of the first to be developed was 
SH, numerous variations of which have since been reported (see below). In general, 
this approach involves hybridization of mRNA /cDN A from one population (tester) 
to excess mRNA/cDNA from another (driver), followed by separation of the 
unhybridized tester fraction (differentially expressed) from the hybridized common 
sequences. This step has been achieved physically, chemically and through the use 
of selective polymerase chain reaction (PCR) techniques. 

Physical separation 

Original subtractive hybridization technology involved the physical separation 
of hybridized commonspecies from unique single stranded species. Several methods 
of achieving this have been described, including hydroxyapatite chromatography 
(Sargent and Dawid 1983), avidin-biotin technology (Duguid and Dinauer 1990) 
and oligodT-latex separation (Hara et al. 1991). In the first approach, common 
mRNA species are removed by cDNA (from test cells)-mRNA (from control cells) 
subtractive hybridization followed by hydroxyapatite chromatography, as hydroxy- 
apatite specifically adsorbs the cDNA-mRNA hybrids. The unabsorbed cDNA is 
then used either for the construction of a cDNA library of differentially expressed 
genes (Sargent and Dawid 1983, Schneider et al. 1988) or directly as a probe to 
screen a preselected library (Zimmerman et al. 1980, Davis et al. 1984, Hedrick et al. 
1984). A schematic diagram of the procedure is shown in figure 1; 

Less rigorous physical separation procedures coupled with sensitivity enhancing 
PCR steps were later developed as a means to overcome some of the problems 
encountered with the hydroxyapatite procedure. For example, Daguid and Dinauer 
(1990) described a method of subtraction utilizing biotin- affinity systems as a means 
to remove hybridized common sequences. In this process, both the control and 
tester mRNA populations are first converted tocDNA and an adaptor ('oligovector 
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-TTTT 
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Unhybridized ^ 

cDNA (differentially expressed) — aaaa 

and mRNA **** 

AAAA 



Sepharose CL6B exclusion — > Small cDNA fragments (<450bp) 
chromatography 



Enriched, differentially expressed cDNA 



or 



Produce clones Label directly and probe library 

Figure 1. The hydroxyapatite method of subtract! ve hybridization. cDNA derived, from the. 
treated /altered (tester) population is mixed with a large excess of mRNA from the control (driver) 
population. Following hybridization, mRNA-cDNA hybrids are removed by hydroxyapatite 
chromatography. The only cDNAs which remain are those which are differentially expressed in 
the treated/altered population. In order to facilitate the recovery of full length clones, small cDNA 
fragments are removed by exclusion chromatography. The remaining cDNAs are then cloned into 
a vector for sequencing, or labelled and used directly to probe a library, as described by Sargent 
and Dawid (1983). 

containing a restriction site) ligated to both sides. Both populations are then 
amplified by PCR, but the driver cDNA population is subsequently digested with 
the adaptor-containing restriction endonuclease: This serves to cleave the oligo- 
vector and reduce the amplification potential of the control population. The digested 
control population is then biotinylated and an excess mixed with tester cDNA. 
Following denaturation and hybridization, the mix is applied to a biocytin column 
(streptavidin may also be used) to remove the control population, including 
heteroduplexes formed by annealing of common sequences from the tester 
population. The procedure is repeated several times following the addition of fresh 
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I 



cDNA synthesis 
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T 
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■•to 



AAAA 
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AAM • 
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cDNA synthesis 

i 
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i . . ; 

Sequence inserts and/or carry out 
other downstream applications 

Figure 2. The use of oligodT^ latex to perform subtractive hybridization. mRNA extracted from the 
control (driver) population is converted to anchored cDNA using polydT oligonucleotides 
attached to latex beads. mRNA from the treated /altered (tester) population is repeatedly 
hybridized against an excess of the anchored driver cDNA. The final population of mRNA is 
tester specific and can be converted into cDNA for cloning and other downstream applications, as 
described by Hara et al. (1991). 
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control cDNA. In order to further enrich those species differentially expressed in 
the tester cDNA, the subtracted tester population is amplified by PCR following 
every second subtraction cycle. After six cycles of subtraction (three reamplification 
steps) the reaction mix is ligated into a vector for further analysis. 

In a slightly different approach, Hara et al. (1991) utilized a method whereby 
oligo(dT 30 ) primers attached to a latex substrate are used to first capture mRNA 
extracted from the control population. Following 1st strand cDNA synthesis, the 
RNA strand of the heteroduplexes is removed by heat denaturation and centri- 
fugation (the cDNA-oligotex-dT^ forms a pellet and the supernatant is removed). 
A quantity of tester mRNA is then repeatedly hybridized to the immobilized control 
(driver) cDNA (which is present in 20-fold excess). After several rounds of 
hybridization the only mRNA molecules left in the tester mRNA population are 
those which are not found in the driver cDNA-oligotex-dT^ population. These 
tester-specific mRNA species are then converted to cDNA and, following the 
addition of adaptor sequences, amplified by PCR. The PCR products are then 
ligated into a vector for further analysis using restriction sites incorporated into the 
PCR primers. A schematic illustration of this subtraction process is shown in figure 
2. 

However, all these methods utilising physical separation have been described as 
inefficient due to the requirement for large starting amounts of mRNA, significant 
loss of material during the separation process and a need for several rounds of 
hybridization. Hence, new methods of differential expression analysis have recently 
been designed to eliminate these problems. 

Chemical Cross -Linking Subtraction ( CCLS ) 

In this technique, originally described by Hampson et ah (1992), driver mRNA 
is mixed with tester cDN A (1st strand only) in a ratio of >20:1. The common 
sequences form cDNA :mRNA hybrids, leaving the tester specific species as single 
stranded cDNA. Instead of physically separating these hybrids, they are inactivated 
chemically using 2,5 diaziridinyl-l,4-benzoquinone (DZQ). Labelled probes are 
then synthesized from the remaining single stranded cDNA species (unreacted 
mRNA species remaining from the driver are not converted into probe material due 
to specificity of Sequenase T7 DNA polymerase used to make the probe) and used 
to screen a cDN A library made from the tester cell population. A schematic diagram 
of the system is shown in figure 3. 

It has been shown that the differentially expressed sequences can be enriched at 
least 300-fold with one round of subtraction (Hampson et al. 1992), and that the 
technique should allow isolation of cDN As derived from transcripts that are present 
at less than 50 copies per cell. This equates to genes at the low end of intermediate 
abundance (see table 1). The main advantages of the CCLS approach are that it is 
rapid, technically simple and also produces fewer false positives than other 
differential expression analysis methods. However, like the physical separation 
protocols, a major drawback with CCLS is the large amount of starting material 
required (at least 10 fjg RNA). Consequently, the technique has recently been 
refined so that a renewable source of RNA can be generated. The degenerate random 
oligonucleotide primed (DROP) adaptation (Hampson et al. 1996, Hampson and 
Hampson 1997) uses random hexanucleotide sequences to prime solid phase- 
synthesized cDNA. Since each primer includes a T7 polymerase promotor sequence 
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Control (driver) mRNA Test (tester) mRNA 

AAAA —AAAA 

AAAA ••». -AAAA 



1st strand cDNA synthesis • 
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Mix and anneal 



mRNA:cDNA hybrids — AAAA 



Unique cDNA species TTTT 

Crc 

(DZQ) added 



|^ Cross linking agent 



Hybrids are cross-lin ked xxxxxxxxx' 



■AAAA 



• tttt 

i 

Probes synthesised from single stranded cDNA 
species and used to probe cDNA library 

Figure 3. Chemical cross-linking subtraction. Excess driver mRNA is mixed with l sl strand tester 
cDNA. The common sequences form mRNA:cDNA hybrids which are cross linked with 2,5 
diaziridinyI-l,4-benzoquinone (DZQ) and the remaining cDNA sequences are differentially 
expressed in the tester population. Probes are made from these sequences using Sequenase 2.0 
DNA polymerase, which lacks reverse transcriptase activity and, therefore, does not react with the 
remaining mRNA molecules from the driver. The labelled probes are then used to screen a cDN A 
library for clones of differentially expressed sequences. Adapted from Walter et al. (1996), with 
permission. 



Table 1. The abundance of mRNA species and classes in a typical mammalian cell. 



mRNA 
class 


Copies of 

each 
species/cell 


No. of mRNA 
species in 
class 


Mean %> of 
each species 
in class 


Mean mass 
(ng) of each 
species//ig 
total RNA 


Abundant 


12000 


4 


3.3 


1.65 


Intermediate 


300 


500 


0.08 


0.04 


Rare 


15 


11000 


0.004 


0.002 



Modified from Bertioli et al. (1995). 
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at the 5' end, the final pool of random cDNA fragments is a PCR-reriewable cDNA 
population which is representative of the expressed gene pool and can be used to 
synthesize sense RNA for use as driver material. Furthermore, if the final pool of 
random cDNA fragments is reamplified using biotinylated T7 primer and random 
hexamer, the product can be captured with streptavidin beads and the antisense 
strand eluted for use as tester. Since both target and driver can be generated from 
the same DROP product, subtraction can be performed in both directions (i.e. for 
up- and down-regulated species) between two different DROP products. 

Representational Difference Analysis ( RDA ) 

RDA of cDNA (Hubank and Schatz 1994) is an extension of the technique 
originally applied to genomic DNA as a means of identifying differences between 
two complex genomes (Lisitsyn et al. 1993). It is a process of subtraction and 
amplification involving subtractive hybridization of the tester in the presence of 
excess driver. Sequences in the tester that have homologues in the driver are 
rendered unamplifiable, whereas those genes expressed only in the tester retain the 
ability to be amplified by PCR. The procedure is shown schematically in figure 4. 

In essence, the driver and tester mRNA populations are first converted to cDNA 
and amplified by PCR following the ligation of an adaptor. The adaptors are then 
removed from both populations and a new (different) adaptor ligated to the 
amplified tester population only. Driver and tester populations are next melted and 
hybridized together in a ratio of 100:1. Following hybridization, only tester:tester 
homohybrids have 5'adaptors at each end of the DNA duplex and can, thus, be filled 
in at both 3 'ends. Hence, only these molecules are amplified exponentially during 
the subsequent PCR step. Although tester : driver heterohybrids are present, they 
only amplify in a linear fashion, since the strand derived from the driver has no 
adaptor to which the primer can bind. Driver : driver heterohybrids have no 
adaptors and, therefore, are not amplified. Single stranded molecules are digested 
with mung bean nuclease before a further PCR-enrichment of the tester : tester 
homohybrids. The adaptors on the amplified tester population are then replaced and 
the whole process repeated a further two or three times using an increasing excess of 
driver (Hubank and Shatz used a tester :driver ratio of 1:400, 1:80000 and 
1 : 800000 for the second, third and fourth hybridizations, respectively). Different 
adaptors are ligated to the tester between successive rounds of hybridization and 
amplification to prevent the accumulation of PCR products that might interfere with 
subsequent amplifications. The final display is a series of differentially expressed 
gene products easily observable on an ethidium bromide gel. 

The main advantages of RDA are that it offers a reproducible and sensitive 
approach to the analysis of differentially expressed genes. Hubank and Schatz (1994) 
reported that they were able to isolate genes that were differentially expressed in 
substantially less than 1% of the cells from which the tester is derived. Perhaps the 
main drawback is that multiple rounds of ligation, hybridization, amplifiation and 
digestion are required. The procedure is, therefore, lengthier than many other 
differential display approaches and provides more opportunity for operator-induced 
error to occur. Although the generation of false positives has been noted, this has 
been solved to some degree by O'Neill and Sinclair (1997) through the use of HPLC- 
purified adaptors. These are free of the truncated adaptors which appear to be a 
major source of the false positive bands. A very similar technique to RDA, termed, 
linker capture subtraction (LCS) was described by Yang and Sytowski (1996). 
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Figure 4. The representational difference analysis (RDA) technique. Driver and tester cDNA are 
digested with a 4 -cutter restriction enzyme such as Dpnll. The 1 st set of 12/24 adaptor strands 
(oligonucleotides) are ligated to each other and the digested cDNA products. The 12mer is 
subsequently melted away and the 3'ends filled in using Taq DNA polymerase. Each cDNA 
population is then amplified using PGR, following which the 1 st set of adaptors is removed with 
Dpnll. A second set of 12/24 adaptor strands is then added to the amplified tester cDNA 
population, after which the tester is hybridized against a large excess of driver. The 12rner 
adaptors are melted and the 3' ends filled in as before. PCR is carried out with primers identical 
to the new 24mer adaptor. Thus, the only hybridization products which are exponentially 
amplified are those which are tester:tester combinations. Following PCR, ssDNA products are 
removed with mung bean nuclease, leaving the 'first difference product'. This is digested and a 
third set of 12/24 adaptors added before repeating the subtraction process from the hybridization 
stage. The process is repeated to the 3 rd or 4 th difference product, as described by Lisitsyn et al. 
(1993) and Hubank and Schatz (1994). 



Differential gene expression 



665 



Suppression PCR Subtractive Hybridization (SSH) 

The most recent adaptation of the SH approach to differential expression 
analysis was first described by Diatchenko et al. (1996) and Gurskaya et al. (1996). 
They reported that a 1000-5000 fold enrichment of rare cDNAs (equivalent to 
isolating mRNAs present at only a few copies per cell) can be obtained without the 
need for multiple hybridizations/subtractions. Instead of physical or chemical 
removal of the common sequences, a PCR-based suppression system is used (see 
figure 5). 

In SSH, excess driver cDNA is added to two portions of the tester cDN A which 
have been ligated with different adaptors. A first round of hybridization serves to 
enrich differentially expressed genes and equalize rare and abundant messages. 
Equalization occurs since. reannealing is more rapid for abundant molecules than for 
rarer molecules due to the second order kinetics of hybridization (James and Higgins 
1 985). The two primary hybridization mixes are then mixed together in the.presence 
of excess driver arid allowed to hybridize further. This step permits the annealing of 
single stranded complementary sequences which did not hybridize in the primary 
hybridization, and in doing so generates templates for PCR amplification. Although 
there are several possible combinations of the single stranded molecules present in 
the secondary hybridization mix, only one particular combination (differentially 
expressed in the tester cDNA composed of complimentary strands having different 
adaptors) can amplify exponentially. 

Having obtained the final differential display, two options are available if cloning 
of cDNAs is desired. One is to transform the whole of the final PCR reaction into 
competent cells. Transformed colonies can then be isolated and their inserts 
characterized by sequencing, restriction analysis or PCR. Alternatively, the final 
PCR products can be resolved on a gel and the individual bands excised, reamplified 
and cloned. The first approach is technically simpler and less time consuming. 
However, ligation/transformation reactions are known to be biased towards the 
cloning of smaller molecules, and so the final population of clones will probably not 
contain a representative selection of the larger products. In addition, although 
equalization theoretically occurs, observations in this laboratory suggest that this is 
by no means perfectly accomplished. Consequently, some gene species are present 
in a higher number than others and this will be represented in the final population 
of clones. Thus, in order to obtain a substantial proportion of those gene species that 
actually demonstrate differential expression in the tester population, the number of 
clones that. will have to be screened after this step may be substantial. The second 
approach is initially more time consuming and technically demanding. However, it 
would appear to offer better prospects for cloning larger and low abundance gel 
products. In addition, one can incorporate a screening step that differentiates 
different products of different sequences but of the same size (HA-staining, see 
later). In this way, a good idea of the final number of clones to be isolated and 
identified can be achieved. 

An alternative (or even complementary) approach is to use the final differential 
display reaction to screen a cDNA library to isolate full length clones for further 
characterization, or a DNA array (see later) to quickly identify known genes. SSH 
has been used in this laboratory to begin characterization of the short-term gene 
expression profiles of enzyme-inducers such as phenobarbital (Rockett et al. 1997) 
and Wy-14, 643 (Rockett et al. unpublished observations). The isolation of 
diff erentially expressed genes in this manner enables the construction of a fingerprint 
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Figure 5. PCR-select cDNA subtraction. In the primary hybridization, an excess of driver cDNA is 
added to each tester cDNA population. The samples are heat denatured and. allowed to hybridize 
for between 3 and 8 h. This serves two purposes: (1) to equalize rare and abundant molecules; and 
(2) to enrich for differentially expressed sequences — cDNAs that are not differentially expressed 
form type c molecules with the driver. In the secondary hybridization, the two primary 
hybridizations are mixed together without denaturing. Fresh denatured driver can also be added 
at this point to allow further enrichment of differentially expressed sequences. Type e molecules 
are formed in this secondary hybridization which are subsequently amplified using two rounds of 
PCR. The final products can be visualized on an agarose gel, labelled directly or cloned info a . 
vector for downstream manipulation. As described by Diatchenko et al. (1996) and Gurskaya - 
et al. (1996), with permission. 
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Figure 6. Flow diagram showing method used in this laboratory to isolate and identify clones of genes 
which are differentially expressed in rat liver following short term exposure to the enzyme 
inducers, phenobarbital and Wy-14,643. 



of expressed genes which are unique to each compound and time/dose point. Such 
information could be useful in short-term characterization of the toxic potential of 
new compounds by comparing the gene-expression profiles they elicit with those 
produced by known inducers. Figure 6 shows a flow diagram of the method used to 
isolate, verify and clone differentially expressed genes, and figure 7 shows expression 
profiles obtained from a typical SSH experiment. Subsequent sub-cloning of the 
individual bands, sequencing and gene data base interrogation reveals many genes 
which are either up- or down-regulated by phenobarbital in the rat (tables, 2 and 3). 

One of the advantages in using the SSH approach is that no prior knowledge is 
required of which specific genes are up/down-regulated subsequent to xenobiotic 
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Figure 7. SSH display patterns obtained from rat liver following 3-day treatment with WY-14,643 or 
phenobarbital. mRNA extracted from control and treated livers was used to generate the 
differential displays using the PCR-Select cDNA subtraction kit (Clontech). Lane: 1— Ikb 
ladder ; 2 — genes upregulated following Wy,14-643 treatment; 3 — genes downregulated following 
Wy, 14-643 treatment; 4— genes upregulated following phenobarbital treatment; 5 — genes 
downregulated following phenobarbital treatment; 6— Ikb ladder. Reproduced from Rockett et 
al. (1997), with permission. 

exposure, and an almost complete complement of genes are obtained. For example, 
the peroxisome proliferator and non-genotoxic hepatocarcinogen Wy,14,643, up- 
regulates at least 28 genes and down-regulates at least 15 in the rat (a sensitive 
species) and produces 48 up- and 37 down-regulated genes in the guinea pig, a 
resistant species (Rockett, Swales, Esda and Gibson, unpublished observations). 
One of these genes, CD81, was up-regulated in the rat and down-regulated in the 
guinea pig following Wy-14,643 treatment. CD81 (alternatively named TAPA-1) is 
a widely expressed cell surface protein which is involved in a large number of cellular 
processes including adhesion, activation, proliferation and differentiation (Levy et 
al. 1998). Since all of these functions are altered to some extent in the phenomena 
of hepatomegaly and non-genotoxic hepatocarcinogenesis, it is intriguing, and 
probably mechanistically-relevant, that CD81 expression is differentially regulated 
in a resistant and susceptible species. However, the down-side of this approach is 
that the majority of genes can be sequenced and matched to database sequences, but 
the latter are predominantly expressed sequence tags or genes of completely 
unknown function, thus partially obscuring a realistic overall assessment of the 
critical genes of genuine biological interest. Notwithstanding the lack of complete 
funtional identification of altered gene expression, such gene profiling studies 
essentially provides a 'molecular fingerprint' in response to xenobiotic challenge, 
thereby serving as a mechanistically-relevant platform for further detailed 
investigations. 

Differential Display (DD) 

Originally described as *RNA fingerprinting by arbitrarily primed PCR* (Liang 
and Pardee 1992) this method is now more commonly referred to as 'differential 
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Band number 



(approximate 


Highest sequence 




size in bp) 


similarity 


FASTA-EMBL gene identification 


5 (1300) 


93.5% 


CYP2B1 


7 (1000) 


95.1% 


Preproalbumin 






Serum albumin mRNA 


8 (950) 


98.3% 


NCI-CGAP-Prl H. sapiens (EST) 


10(850) 


95.7-% 


CYP2B1 


11 (800) . 


. Clone 1 94.9% 


CYP2B1 




Clone 2 75.3% 


CYP2B2 


12 (750) 


93.8% 


TRPM-2 mRNA 






Sulfated glycoprotein 


15 (600) 


92.9% 


Preproalbumin 






Serum albumin mRNA 


16(55) 


Clone 1 95.2% 


CYP2B1 




' Clone 2 93.6% 


Haptoglobulin mRNA partial alpha 


21 (350) 


99.3% 


18S, 5.8S & 28S rRNa 



Bands 1-4, 6, 9, 13, 14, and 17-20 are shown to be false positives by dot blot anaylsis and, therefore, 
are not sequenced. Derived from Rockett et al. (1997). It should be noted that the above genes do not 
represent the complete spectrum of genes which are up-regulated in rat liver by phenobarbital, but 
simply represents the genes sequenced and identified to date. 



Table 3. Genes down-regulated in rat liver following 3-day exposure to phenobarbital. 



Band number 

(approximate Highest sequence 

size in bp) similarity FASTA-EMBL gene identification 



1 (1500) 




95.3% 


3-oxoacyl-CoA thiolase 


2 (1200) 




92.3% 


Hemopoxin mRNA 


3 (1000) 




91.7% 


Alpha-2u-globulin mRNA 


7 (700) 


Clone 1 


77.2.% 


M .musculus CI inhibitor 




Clone 2 


94.5% . 


Electron transfer fl apoprotein 




Clone 3 


91.0% 


M. musculus Topoisomerase 1 (Topo 1) 


8 (650) 


Clone 1 


86.9% 


Soares 2NbMT M. musculus (EST) 


Clone 2 


96.2% 


Alpha-2u-globulin (s-type) mRNA 


9 (600) 


Clone 1 


86.9% 


Soares mouse NMLM. musculus (EST) 




Clone 2 


82.0% 


Soares p3NMF 19.5 M. musculus (EST) 


10 (550) 




73.8% 


Soares mouse NMLM. musculus (EST) 


11 (525) 




95.7% 


NCI-CGAP-Prl H. sapiens (EST) 


12 (375) 




100.0% 


Ribosomal protein 


13 (23) 


Clone 1 


97.2% 


Soares mouse embryo NbME135 (EST) 




Clone 2 


100.0% 


, Fibrinogen B-beta-chain 




Clone 3 


100.0% 


Apolipoprotein E gene 


14(170) 




96.0% 


Soares p3NMF19.5 M. musculus (EST) 


15 (140) 




97.3% 


Stratagene mouse testis (EST) 


Others : (300) 




96.7% 


R. norvegicus RASP 1 mRNA 


(275) 




93.1% 


Soares mouse mammary gland (EST) 



EST = Expressed sequence tag. Bands 4-6 were shown to be false positives by dot blot analysis and, 
therefore, were not sequenced. Derived from Rockett et al. (1997). It should be noted that the above genes 
do not represent the complete spectrum of genes which are down-regulated in rat liver by phenobarbital, 
but simiply represents the genes sequenced and identified to date. 



display' (DD). In this method, all the mRNA species in the control and treated cell 
populations are amplified in separate reactions using reverse transcriptase-PCR 
(RT-PCR). The products are then run side-by-side on sequencing gels. Those 
bands which are present in one display only, or which are much more intense in one 
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display compared to the other, are differentially expressed and may be recovered for 
further characterization. One advantage of this system is the speed with which it can 
be carried out — 2 days to obtain a display and as little as a week to make and identify 
clones. 

Two commonly used variations are based on different methods of priming the 
reverse transcription step (figure 8). One is to use an oligo dT with a 2-base * anchor' 
at the 3'-end, e.g. 5' (dT n )CA 3' (Liang and Pardee 1992). Alternatively, an 
arbitrary primer may be used for 1st strand cDNA synthesis (Welsh et al. 1992). 
This variant of RNA fingerprinting has also been called 'RAP' (RNA Arbitrarily 
Primed)-PCR. One advantage of this second approach is that PCR products may be 
derived from anywhere in the RNA, including open reading frames. In addition, it 
can be used for mRNAs that are notpolyadenylated, such as many bacterial mRNAs 
(Wong and McClelland 1994). In both cases, following reverse transcription and 
denaturation, second strand cDNA synthesis is carried out with an arbitrary primer 
(arbitrary primers have a single base at each position, as compared to random 
primers, which contain a mixture of all four bases at each position). The resulting 
PCR, thus, produces a series of products which, depending on the system (primer 
length and composition, polymerase and gel system), usually includes 50-100 
products per primer set (Band and Sager 1989). When a combination of different 
dT-anchors and arbitrary primers are used, almost all mRNA species from a cell can 
be amplified. When the cDNA products from two different populations are analysed 
side by side on a polyacrylamide gel, differences in expression can be identified and 
the appropriate bands recovered for cloning and further analysis. 

Although DD is perhaps the most popular approach used today for identifying 
differentially expressed genes, it does suffer from several perceived disadvantages: 

(1) It may have a strong bias towards high copy number mRNAs (Bertioli et al. 
1995), although this has been disputed (Wane* al. 1996) and the isolation of very 
low abundance genes may be achieved in certain circumstances (Guimeraes et 
al. 1995a). 

(2) The cDNAs obtained often only represent the extreme 3' end of the mRNA 
(often the 3 '-untranslated region), although this may not always be the case 
(Guimeraes et al. 1995a). Since the 3 'end is often not included in Genbank and 
shows variation between organisms, cDNAs identified by DD cannot always be 
matched with their genes, even if they have been identified. 

(3) The pattern of differential expression seen on the display often cannot be 
reproduced on Northern blots, with false positives arising in up to 70% of cases 
(Sun et al. 1994). Some adaptations have been shown to reduce false positives, 
including the use of two reverse transcriptases (Sung and Denman 1997), 
comparison of uninduced and induced cells over a time course (Burn et al. 1 994) 
and comparison of DDPCR-products from two uninduced and two induced 
lines (Sompayrac et al. 1995). The latter authors also reported that the use of 
cytoplasmic RNA rather then total RNA reduces false positives arising from 
nuclear RNA that is not transported to the cytoplasm. 

Further details of the background, strengths and weaknesses of the DD 
technique can be obtained from a review by McClelland et al. (1996) and from 
articles by Liang et al. (1995) and Wan et al. (1996). 
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Figure 8. Two approaches to differential display (DD) analysis. 1 st strand synthesis can be carried out 
either with a polydT,, NN primer (where N = G, C or A) or with an arbitrary primer. The use of 
different combinations of G,C and A to anchor the first strand polydT primer enables the priming 
of the majority of polyadenylated mRNAs. Arbitrary primers may hybridize at none, one or more 
places along the length of the mRNA, allowing l sl strand cDNA synthesis to occur at none, one 
or more points in the same gene. In both cases, 2 nd strand synthesis is carried out with an arbitrary 
primer. Since these arbitrary primers for the 2 nd strand may also hybridize to the 1 st strand cDNA 
in a number of different places, several different 2 nd strand products may be obtained from one 
binding point of the l sl strand primer. Following 2 nd strand synthesis, the original set of primers 
is used to amplify the second strand products, with the result that numerous gene sequences are 
amplified. 



Restriction endonuclease-facilitated analysis of gene expression 

Serial Analysis of Gene Expression (SAGE) 

A more recent development in the field of differential display is SAGE analysis 
(Velculescu et al. 1995). This method uses a different approach to those discussed so 
far and is based on two principles. Firstly, in more than 95% of cases, short 
nucleotide sequences (' tags') of only nine or 10 base pairs provide sufficient 
information to identify their gene of origin. Secondly, concatenation (linking 
together in a series) of these tags allows sequencing of multiple cDNAs within a 
single clone. Figure 9 shows a schematic representation of the SAGE process. In this 
procedure, double stranded cDNA from the test cells is synthesized with a 
biotinylated polydT primer. Following digestion with a commonly cutting (4bp 
recognition sequence) restriction enzyme (* anchoring enzyme'), the 3' ends of the 
cDN A population are captured with streptavidin beads. The captured population is 



672 



J. C. Rockett et al. 



split into two and different adaptors ligated to the 5 'ends of each group. Incorporated 
into the adaptors is a recognition sequence for a type IIS restriction enzyme — one 
which cuts DNA at a defined distance (< 20 bp) from its recognition sequence. 
Hence, following digestion of each captured cDN A population with the IIS enzyme, 
the adaptors plus a short piece of the captured cDNA are released. The two 
populations are then ligated and the products amplified. The amplified products are 
cleaved with the original anchoring enzyme, religated (concatomers are formed in 
the process) and cloned. The advantage of this system is that hundreds of gene tags 
can be identified by sequencing only a few clones. Furthermore, the number of times 
a given transcript is identified is a quantitative measurement of that gene's 
abundance in the original population, a feature which facilitates identification of 
differentially expressed genes in different cell populations. 

Some disadvantages of SAGE analysis include the technical difficulty of the 
method, a large amount of accurate sequencing is required, biased towards abundant 
mRNAs, has not been validated in the pharmaco/toxicogenomic setting and has 
only been used to examine well known tissue differences to date. 

Gene Expression Fingerprinting ( GEF ) 

A different capture/restriction digest approach for isolating differentially 
expressed genes has been described by Ivanova and Belyavsky (1995). In this 
method, RNA is converted to cDNA using biotinylated oligo(dT) primers. The 
cDNA population is then digested with a specific endonuclease and captured with 
magnetic streptavidin microbeads to facilitate removal of the unwanted 5' digestion 
products. The use of restricted 3 '-ends alone serves to reduce the complexity of the 
cDNA fragment pool and helps to ensure that each RNA species is represented by 
not more than one restriction product. An adaptor is ligated to facilitate subsequent 
amplification of the captured population. PCR is carried out with one adaptor- 
specific and one biotinylated polydT primer. The reamplified population is 
recaptured and the non-biotinylated strands removed by alkaline dissociation. The 
non-biotinylated strand is then resynthesized using a different adaptor-specific 
primer in the presence of a radiolabeled dNTP. The labelled immobilized 3 ' cDN A 
ends are next sequentially treated with a series of different restriction endonucleases 
and the products from each digestion analysed by PAGE. The result is a fingerprint 
composed of a number of ladders (equal to the number of sequential digests used). 
By comparing test versus control fingerprints, it is possible to identify differentially 
expressed products which can then be isolated from the gel and cloned. The 
advantages of this procedure are that it is very robust and reproducible, and the 
authors estimate that 80-93% of cDNA molecules are involved in the final 
fingerprint. The disadvantage is that polyacrylamide gels can rarely resolve more 
than 300-400 bands, which compares poorly to the 1000 or more which are 
estimated to be produced in an average experiment. The use of 2-D gels such as 
those described by Uitterlinden et al. (1989) and Hatada et al. (1991) may help to 
overcome this problem. 

A similar method for displaying restriction endonuclease fragments was later 
described by Prashar and Weissman (1996). However, instead of sequential 
digestion of the immobolized 3'-terminal cDNA fragments, these authors simply 
compared the profiles of the control and treated populations without further 
manipulation. 
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Figure 9. Serial analysis of gene expression (SAGE) analysis. cDNA is cleaved with an anchoring enzyme 
(AE) and the 3'ends captured using streptavidin beads. The cDNA pool is divided in half and each 
portion ligated to a different linker, each containing a type IIS restriction site (tagging enzyme, 
TE). Restriction with the type IIS enzyme releases the linker plus a short length of cDNA 
(XXXXX and OOOOO indicate nucleotides of different tags). The two pools of tags are then 
ligated and amplified using linker-specific primers. Following PCR, the products are cleaved with 
the AE and the ditags isolated from the linkers using PAGE. The ditags are then ligated (during 
which process, concatenization occurs) and cloned into a vector of choice for sequencing. After 
Velculescu et ai. (1995), with permission. 
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DNA arrays 

'Open* differential display systems are cumbersome in that it takes a great deal 
of time to extract and identify candidate genes and then confirm that they are indeed 
up- or down-regulated in the treated compared to the control tissue. Normally, the 
latter process is carried out using Northern blotting or RT-PCR. Even so, each of 
the aforementioned steps produce a bottleneck to the ultimate goal of rapid analysis 
of gene expression. These problems will likely be addressed by the development of 
so-called DNA arrays (e.g. Gress et al. 1992, Zhao et al. 1995, Schena et al. 1996), 
the introduction of which has signalled the next era in differential gene expression 
analysis. DNA arrays consist of a gridded membrane or glass 'chips' containing 
hundreds or thousands of DNA spots, each consisting of multiple copies of part of 
a known gene. The genes are often selected based on previously proven involvement 
in oncogenesis, cell cycling, DNA repair, development and other cellular processes. 
They are usually chosen to be as specific as possible for each gene and animal species. 
Human and mouse arrays are already commercially available and a few companies 
will construct a personalized array to order, for example Clontech Laboratories and 
Research Genetics Inc. The technique is rapid in that hundreds or even thousands 
of genes can be spotted on a single array, and that mRNA/cDNA from the test 
populations can be labelled and used directly as probe. When analysed with 
appropriate hardware and software, arrays offer a rapid arid quantitative means to 
assess differences in gene expression between two cell populations. Of course, there 
can only be identification and quantitation of those genes which are in the array 
(hence the term 'closed* system). Therefore, one approach to elucidating the 
molecular mechanisms involved in a particular disease/development system may be 
to combine an open and closed system — a DNA array , to directly identify and 
quantitate the expression of known genes in mRNA populations, and an open 
system such as SSH to isolate unknown genes which are differentially expressed. 

One of the main advantages of DNA arrays is the huge number of gene fragments 
which can be put on a membrane — some companies have reported gridding up to 
60000 spots on a single glass 'chip* (microscope slide). These high density chip- 
based micro-arrays will probably become available as mass-produced off-the-shelf 
items in the near future. This should facilitate the more rapid determination of 
differential expression in time and dose-response experiments. Aside from their 
high cost and the technical complexities involved in producing and probing DNA 
arrays, the main problem which remains, especially with the newer micro-array 
(gene-chip) technologies, is that results are often not wholly reproducible between 
arrays. However, this problem is being addressed and should be resolved within the 
next few years. 



EST databases as a means to identify differentially expressed genes 

Expressed sequence tags (ESTs) are partial sequences of clones obtained from 
cDNA libraries. Even though most ESTs have no formal identity (putative 
identification is the best to be hoped for), they have proven to be a rapid and efficient 
means of discovering new genes and can be used to generate profiles of gene- 
expression in specific cells. Since they were first described by Adams et al. (1991), 
there has been a huge explosion in EST production and it is estimated that there are 
now well over a million such sequences in the public domain, representing over half 
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of all human genes (Hillier et al. 1996). This large number of freely available 
sequences (both sequence information and clones are normally available royalty-free 
from the originators) has enabled the development of a new approach towards 
differential gene expression analysis as described by Vasmatzis et al. (1998). The 
approach is simple in theory : EST databases are first searched for genes that have a 
number of related EST sequences from the target tissue of choice, but none or few 
from non-target tissue libraries. Programmes to assist in the assembly of such sets of 
overlapping data may be developed in-house or obtained privately or from the 
internet. For example, the Institute for Genomic Research (TIGR, found at 
http://Www.tigr.org) provides many software tools free of charge to the scientific 
community. Included amongst these is the TIGR assembler (Sutton et al. 1995), a 
tool for the assembly of large sets of overlapping data such as ESTs, bacterial 
artificial chromosomes (BAC)s, or small genomes. Candidate EST clones repre- 
senting different genes are then analysed using RN A blot methods for size and tissue 
specificity and, if required, used as probes to isolate and identify the full length 
cDNA clone for further characterization. In practice however, the method is rather 
more involved, requiring bioinformatic and computer analysis coupled with 
confirmatory molecular studies. Vasmatzis et al. (1998) have described several 
problems in this fledgling approach, such as separating highly homologous 
sequences derived from different genes and an overemphasis of specificity for some 
EST sequences. However, since these problems will largely be addressed by the 
development of more suitable computer algorithms and an increased completeness 
of the EST database, it is likely that this approach to identifying differentially 
expressed genes may enjoy more patronage in the future. 



Problems and potential of differential expression techniques 

The holistic or single cell approach? 

When working with in vivo models of differential expression, one of the first 
issues to consider must be the presence of multiple cell types in any given specimen. 
For example, a liver sample is likely to contain not only hepatocytes, but also 
(potentially) Ito cells, bile ductule cells, endothelial cells, various immune cells (e.g. 
lymphocytes, macrophages and Kupffer cells) and fibroblasts. Other tissues will 
each have their own distinctive cell populations. Also, in the case of neoplastic tissue, 
there are almost always normal, hyperplastic and/or dysplastic cells present in a 
sample. One must, therefore, be aware that genes obtained from a differential 
display experiment performed on an animal tissue model may not necessarily arise 
exclusively from the intended 'target' cells, e.g. hepatocytes/neoplastic cells. If 
appropriate, further analyses using immunohistochemistry, in situ hybridization or 
in situ RT-PCR should be used to confirm which cell types are expressing the 
gene(s) of interest. This problem is probably most acute for those studying the 
differential expression of genes in the development of different cell types, where 
there is a need to examine homologous cell populations. The problem is now being 
addressed at the National Cancer Institute (Bethesda, MD, USA) where new micro- 
disection techniques have been employed to assist in their gene analysis programme, 
the Cancer Genome Anatomy Project (CGAP) (For more information see web site: 
http :/ ^vww . ncbi .nlm . nih .gov /ncicgap /intro.html). There are also separation tech- 
niques available that utilise cell-specific antigens as a means to isolate target cells, 
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e.g. fluorescence activated cell sorting (FACS) (Dunbar et al. 1998, Kas-Deelen et 
al. 1998) and magnetic bead technology (Richard et al. 1998, Rogler et al. 1998). 

However, those taking a holistic approach may consider this issue unimportant. 
There is an equally appropriate view that all those genes showing altered expression 
within a compromized tissue should be taken into consideration. After all, since all 
tissues are complex mixes of different, interacting cell types which intimately 
regulate each other's growth and development, it is clear that each cell type could in 
some way contribute (positively or negatively) towards the molecular mechanisms 
which lie behind responses to external stimuli or neoplastic growth. It is perhaps 
then more informative to carry out differential display experiments using in vivo as 
opposed to in vitro models, where uniform populations of identical cells probably 
represent a partial, skewed or even inaccurate picture of the molecular changes that 
occur. 

The incidence and possible implications of inter-individual biological variation 
should be considered in any approach where whole animal models are being used. It 
is clear that individuals (humans and animals) respond in different ways to identical 
stimuli. One of the best characterized examples is the debrisoquine oxidation 
polymorphism, which is mediated by cytochrome CYP2D6 and determines the 
pharmacokinetics of many commonly prescribed drugs (Lennard 1993, Meyer and 
Zanger 1997). The reasons for such differences are varied and complex, but allelic 
variations, regulatory region polymorphisms and even physical and mental health 
can all contribute to observed- differences in individual responses. Careful thought 
should, therefore, be given to the specific objectives of the study and to the possible 
value of pooling starting material (tissue/mRNA). The effect of this can be 
beneficial through the ironing out of exaggerated responses and unimportant minor 
fluctuations of (mechanistically) irrelevant genes in individual animals, thus 
providing a clearer overall picture of the general molecular mechanisms of the 
response. However, at the same time such minor variations may be of utmost 
importance in deciding the ability of individual animals to succumb to or resist the 
effects of a given chemical/disease. 



How efficient are differential expression techniques at recovering a high percentage of 
differentially expressed genes? 

A number of groups have produced experimental data suggesting that mam- 
malian cells produce between 8000-15000 different mRNA species at any one time 
(Mechler and Rabbitts 1981, Hedrick et al. 1984, Bravo 1990), although figures as 
high as 20-30000 have also been quoted (Axel et al. 1976). Hedrick et al. (1984) 
provided evidence suggesting that the majority of these belong to/the rare abundance 
class. A breakdown of this abundance distribution is shown in table 1. 

When the results of differential display experiments have been compared with 
data obtained previously using other methods, it is apparent that not all differentially 
expressed mRNAs are represented in the final display. In particular, rare messages 
(which, importantly, often include regulatory proteins) are not easily recovered 
using differential display systems. This is a major shortcoming, as the majority of 
mRNA species exist at levels of less than 0.005% of the total population (table 1). 
Bertioli et al. (1995) examined the efficiency of DD templates (heterogeneous 
mRNA populations) for recovering rare messages and were unable to detect mRNA 
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species present at less than 1.2% of the total mRN A population — equivalent to an 
intermediate or abundant species. Interestingly, when simple model systems (single 
target only) were used instead of a heterogeneous mRNA population, the same 
primers could detect levels of target mRNA down to 10000X smaller. These results 
are probably best explained by competition for substrates from the many PCR 
products produced in a DD reaction. 

The numbers of differentially expressed mRN As reported in the literature using 
various model systems provides further evidence that many differentially expressed 
mRNAs are not recovered. For example, DeRisi et al. (1997) used DNA array 
technology to examine gene expression in yeast following exhaustion of sugar in the 
medium, and found that more than 1700 genes showed a change in expression of at 
least 2-fold. In light of such a finding, it would not be unreasonable to suggest that 
of the 8000-15 000 different mRNA species produced by any given mammalian cell, 
up to 1000 or more may show altered expression following chemical stimulation. 
Whilst this may be an extreme figure, it is known that at least 100 genes are 
activated/upregulated in Jurkat (T-) cells following IL-2 stimulation (Ullman et al. 
1990). In addition, Wan et al, (1996) estimated that interferon- ^-stimulated HeLa 
cells differentially express up to 433 genes (assuming 24000 distinct mRNAs 
expressed by the cells). However, there have been few publications documenting 
anywhere near the recovery of these numbers. For example, in using DD to compare 
normal and regenerating mouse liver, Bauer et al. (1993) found only 70 of 38000 
total bands to be different. Of these, 50%. (35 genes) were shown to correspond to 
differentially expressed bands. Chen et al* (1996) reported 10 genes upregulated in 
female rat liver following ethinyl estradiol treatment. McKenzie and Drake (1997) 
identified 14 different gene products whose expression was altered by phorbol 
myristate acetate (PMA, a tumour promoter agent) stimulation of a human 
myelomonocytic cell line. Kilty and Vickers (1997) identified 10 different gene 
products whose expression was upregulated in the peripheral blood leukocytes of 
allergic disease sufferers. Linskens et al. (1995) found 23 genes differentially 
expressed between young and senescent fibroblasts. Techniques other than DD 
have also provided an apparent paucity of differentially expressed genes. Using SH 
for example, Cao et al. (1997) found 15 genes differentially expressed in colorectal 
cancer compared to normal mucosal epithelium. Fitzpatrick et al. (1995) isolated 17 
genes upregulated in rat liver following treatment with the peroxisome proliferator, 
clofibrate; Philips et al. (1990) isolated 12 cDNA clones which were upregulated in 
highly metastatic mammary adenocarcinoma cell lines compared to poorly meta^ 
static ones. Prashar and Weissman (1996) used 3' restriction fragment analysis and 
identified approximately 40 genes showing altered expression within 4 h of 
activation of Jurkat T-cells. Groenink and Leegwater (1996) analysed 27 gene 
fragments isolated using SSH of delayed early response phase of liver regeneration 
and found only 12 to be upregulated. 

In the laboratory, SSH was used to isolate up to 70 candidate genes which appear 
to show altered expression in guinea pig liver following short-term treatment with 
the peroxisome proliferator, WY-14,643 (Rockett, Swales, Esdaile and Gibson, 
unpublished observations). However, these findings have still to be confirmed by 
analysis of the extracted tissue mRNA for differential expression of these sequences. 

Whilst the latest differential display technologies are purported to include design 
and experimental modifications to overcome this lack of efficiency (in both the total 
number of differentially expressed genes recovered and the percentage that. are true 
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positives), it is still not clear if such adaptations are practically effective — proving 
efficiency by spiking with a known amount of limited numbers of artificial 
construct(s) is one thing, but isolating a high percentage of the rare messages already 
present in an mRNA population is another. Of course, some models will genuinely 
produce only a small number of differentially expressed genes. In addition, there are 
also technical problems that can reduce efficiency. For example, mRNAs may have 
an unusual primary structure that effectively prevents their amplification by PCR- 
based systems. In addition, it is known that under certain circumstances not all 
mRNAs have 3 'poly A sites. For example, during Xenopus development, deadenyl- 
ation is used as a- means to stabilize RNAs (Voeltz and Steitz 1998), whilst 
preferential deadenylation may play a role in regulating Hsp70 (and perhaps, 
therefore, other stress protein) expression in Drosophila (Dellavalle et al. 1994). The 
presence of deadenylated mRNAs would clearly reduce the efficiency of systems 
utilizing a polydT reverse transcription step. The efficiency of any system also 
depends on the quality of the starting material. All differential display techniques 
use mRNA as their target material. However, it is difficult to isolate mRNA that is 
completely free of ribosomal RNA. Even if poiydT primers are used to prime first 
strand cDNA synthesis, ribosomal RNA is often transcribed to some degree 
(Clontech PCR-Select cDNA Subtraction kit user manual). It has been shown, at 
least in the case of SSH, that a high rRNA:mRNA ratio can lead to inefficient 
subtractive hybridization (Clontech PCR-Select cDNA Subtraction kit user 
manual), and there is no reason to suppose that it will not do likewise in other SH 
approaches. Finally, those techniques that utilise a presubtraction amplification step 
(e.g. RDA) may present a skewed representation since some sequences amplify 
better than others. 

Of course, probably the most important consideration is the temporal factor. It 
is clear that any given differential display experiment can only interrogate a cell at 
one point in time. It may well be that a high percentage of the genes showing altered 
expression at that time are obtained. However, given that disease processes and 
responses to environmental stimuli involve dynamic cascades of signalling, 
regulation, production and action, it is clear that all those genes which are switched 
on/off at different times will not be recovered and, therefore, vital information may 
well be missed. It is, therefore, imperative to obtain as much information about the 
model system beforehand as possible, from which a strategy can be derived for 
targeting specific time points or events that are of particular interest to the 
investigator. One way of getting round this problem of single time point analysis is 
to conduct the experiment over a suitable time course which, of course, adds 
substantially to the amount of work involved. 



How sensitive are differential expression technologies? 

There has been little published data that addresses the issue of how large the 
change in expression must be for it to permit isolation of the gene in question with 
the various differential expression technologies. Although the isolation of genes 
whose expression is changed as little as 1.5-fold has been reported using SSH 
(Groenink and Leegwater 1996), it appears that those demonstrating a change in 
excess of 5-fold are more likely to be picked up. Thus, there is a 'grey zone' 
in between where small changes could fade in and out of isolation between 
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experiments and animals. DD, on the other hand, is not subject to this grey 
zone since, unlike SH approaches, it does not amplify the difference in expression 
between two samples. Wan et al. (1996) reported that differences in expression of 
twofold or more are detectable using DD. 

Resolution and visualization of differential expression products 

. It seems highly improbable with current technology that a gelsystem could be 
developed that is able to resolve all gene species showing altered expression in any 
given test system (be it SH- or DD-based). Polyacrylamide gel electrophoresis 
(PAGE) can resolve size differences down to 0.2 %> (Sambrook et al. 1989) and are 
used as standard in DD experiments. Even so, it is clear that a complex series of gene 
products such as those seen in a DD will contain unresolvable components. Thus, 
what appears to be one band in a gel may in fact turn out to be several. Indeed, it has 
been well documented (Mathieu-Daude et al. 1996, Smith et al. 1997) that a single 
band extracted from a DD often represents a composite of heterogeneous products, 
and the same has been found for SSH displays in this laboratory (Rockett et aL 
1997). One possible solution was offered by Mathieu-Daude et al. (1996), who 
extracted and reamplified candidate bands from a DD display and used single strand 
conformation polymorphism (SSCP) analysis to confirm which components 
represented the truly differentially expressed product. 

Many scientists often try to avoid the use of PAGE where possible because it is 
technically more demanding than agarose gel electrophoresis^ GE). Unfortunately, 
high resolution agarose gels such as Metaphor (FMG, Lichfield, UK) and AquaPor 
HR (National Diagnostics, Hessle, UK), whilst easier to prepare and manipulate 
than PAGE, can only separate DNA sequences which differ in size by around 
1.5-2% (15-20 base pairs for a 1Kb fragment). Thus, SSH, RDA or other such 
products which differ in size by less than this amount are normally not resolvable. 
However, a simple technique does in fact exist for increasing the resolving power of 
AGE^the inclusion of H A-red (10-phenyl neutral red-PEG ligand) or HA-yellow 
(bisbenzamide-PEG ligand) (Hanse Analytik GmbH, Bremen, Germany) in a 
gel separates identical or closely sized products on base content. Specifically, 
HA-red and -yellow selectively bind to GC and AT DNA motifs, respectively 
(Wawer et al. 1995, Hanse Analytik 1997, personal communication). Since both 
HA-stains possess an overall positive charge, they migrate towards the cathode 
when an electric field is applied. This is in direct opposition to DNA, which 
is negatively charged and, therefore, migrates towards the anode. Thus, if two 
DNA clones are identical in size (as perceived on a standard high resolution 
agarose gel), but differ in AT/GC content, inclusion of a HA-dye in the gel 
will effectively retard the migration of one of the sequences compared to the 
other, effectively making it apparently larger and, thus, providing a means of 
differentiating between the two. The use of HA-red has been shown to resolve 
sequences with an AT variation of less than 1 % (Wawer et al. 1995), whilst Hanse 
Analytik have reported that HA staining is so sensitive that in one case it was used 
to distinguish two 567bp sequences which differed by only a single point mutation 
(Hanse Analytik 1996, personal communication). Therefore, if one wishes to check 
whether all the clones produced from a specific band in a differential display 
experiment are derived from the same gene species, a small amount of reamplified 
or digested clone can be run on a standard high resolution gel, and a second aliquot 
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Figure 10. Discrimination of clones of identical/nearly identical size using HA-red. Bands of decreasing 
size (1-5) were extracted from trie final display of a suppression subtractive hybridization 
experiment and cloned. Seven colonies were picked at random from each cloned band and their 
inserts amplified using PCR. The products were run on two gels, (A) a high resolution 2% agarose 
gel, and (B) a high resolution 2% agarose gel containing 1 U/ml HA-red. With few exceptions, all 
the clones from each band appear to be the same size (gel A). However, the presence of HA-red 
(gel B), which separates identically-sized DNA fragments based on the percentage of GC within 
the sequence, clearly indicates the presence of different gene species within each band. For 
example, even though all five re-amplified clones of band 1 appear to be the same size, at least four 
different gene species are represented. 

in a similar gel containing one of the HA-stains. The standard gel should indicate 
any gross size differences, whilst the HA-stained gel should separate otherwise 
unresolvable species (on standard AGE) according to their base content. Geisinger 
et al. (1997) reported successful use of this approach for identifying DD-derived 
clones. Figure 10 shows such an experiment carried out in this laboratory on clones 
obtained from a band extracted from an SSH display. 

An alternative approach is to carry out a 2-D analysis of the differential display 
products. In this approach, size-based separation is first carried out in a standard 
agarose gel. The gel slice containing the display is then extracted and incorporated 
in to a HA gel for resolution based on AT/GC content. 

Of course, one should always consider the possibility of there being different 
gene species which are the same size and have the same GC/AT content. However, 
even these species are not unresolvable given some effort — again, one might use 
SSCP,or perhaps a denaturing gradient gel electrophoresis (DGGE) or temperature 
gradient field electrophoresis (TGGE) approach to resolve the contents of a band, 
either directly on the extracted band (Suzuki et al. 1991) or on the reamplined 
product. 

The requirement of some differential display techniques 1 to visualize large 
numbers of products (e.g. DD and GEF) can also present a problem in that, in terms 
of numbers, the resolution of PAGE rarely exceeds 300-400 bands. One approach to 
overcoming this might be to use 2-D gels such as those described by Uitterlinden et 
al. (1989) and Hatada et al. (1991). 
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Extraction of differentially expressed bands from a gel can be complex since, in 
some cases (e.g. DD, GEF), the results are visualized by autoradiographic means, 
such that precise overlay of the developed film on the gel must occur if the correct 
band is to be extracted for further analysis. Clearly, a misjudged extraction can 
account for many man-hours lost. This problem, and that of the use of radioisotopes, 
has been addressed by several groups. For example, Lohmann et al. (1995) 
demonstrated that silver staining can be used directly to visualize DD bands in 
horizontal PAGs. An et al, (1996) avoided the use of radioisotopes by transferring a 
small amount (20-30%) of the DNA from their DD to a nylon membrane, and 
visualizing the bands using chemiluminescent staining before going back to extract 
the remaining DNA from the gel. Chen and Peck (1996) went one step further and 
transferred the entire DD to a nylon membrane. The DNA bands were then 
visualized using a digoxigenin (DIG) system (DIG was attached to the polydT 
primers used in the differential display procedure). Differentially expressed bands 
were cut from the membrane and the DNA eluted by washing with PCR buffer prior 
to reamplification. 

One of the advantages of using techniques such as SSH and RD A is that the final 
display can be run on an agarose gel and the bands visualized with simple ethidium 
bromide staining. Whilst this approach can provide acceptable results, overstaining 
with SYBR Green I or SYBR Gold nucleic acid stains (FMC) effectively enhances 
the intensity and sharpness of the bands. This greatly aids in their precise extraction 
and often reveals some faint products that may otherwise be overlooked. Whilst 
differential displays stained with SYBR Green I are better visualized using short 
wavelength UV (254 nm) rather than medium wavelength (306 nm), the shorter 
wavelength is much more DNA damaging. In practice, it takes only a few seconds 
to damage DNA extracted under 254 nm irradiation, effectively preventing 
reamplification and cloning. The best approach is to bverstain with SYBR Green I 
and extract bands under a medium wavelength UV transillumination. 

The possible use of 'microfingerprinting ' to reduce complexity 

Given the sheer number of gene products and the possible complexity of each 
band, an alternative approach to rapid characterization may be to use an enhanced 
analysis of a small section of a differential display — a 'sub-fingerprint ' or * micro- 
fingerprint'. In this case, one could concentrate on those bands which only appear 
in a particular chosen size region. Reducing the fingerprint in this way has at least 
two advantages. One is that it should be possible to use different gel types, 
concentrations and run times tailored exactly to that region. Currently, one might 
run products from 100-3000 4- bp on the same gel, which leads to compromize in the 
gel system being used and consequently to suboptimal resolution, both in terms of 
size and numbers, and can lead to problems in the accurate excision of individual 
bands. Secondly, it may be possible to enhance resolution by using a 2-D analysis 
using a HA-stain, as described earlier. In summary, if a range of gene product sizes 
is carefully chosen to included certain * relevant* genes, the 2-D system standardized, 
and appropriate gene analysis used, it may be possible to develop a method for the 
early and rapid identification of compounds which have similar or widely different 
cellular effects. If the prognosis for exposure to one or more other chemicals which 
display a similar profile is already known, then one could perhaps predict similar 
effects for any new compounds which show a similar micro-fingerprint. 
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An alternative approach to microfingerprinting is to examine altered expression 
in specific families of genes through careful selection of PCR primers and /or post- 
reaction analysis. Stress genes, growth factors and/or their receptors, cell cycling 
genes, cytochromes P450 and regulatory proteins might be considered as candidates 
for analysis in this way. Indeed, some off-the-shelf DNA arrays (e.g. Clontech's 
Atlas cDNA Expression Array series) already anticipated this to some degree by 
grouping together genes involved in different responses e.g. apoptosis, stress, DNA- 
dam age response etc. 



Screening 

False positives 

The generation of false positives has been discussed at length amongst the 
differential display community (Liang etal. 1993, 1995, Nishio et al. 1994, Sun et al. 
1994, Sompayrac et al. 1995). The reason for false positives varies with the 
technique being used. For instance, in RDA, the use of adaptors which have not 
been HPLC purified can lead to the production of false positives through illegitimate 
ligation events (O'Neill and Sinclair 1997), whilst in DD they can arise through 
PCR artifacts and illegitimate transcription of rRNA. In SH, false positives appear 
to be derived largely from abundant gene species, although some may arise from 
cDNA/mRNA species which do not undergo hybridization for technical reasons. 

A quick screening of putative differentially expressed clones can be carried out 
using a simple dot blot approach, in which labelled first strand probes synthesized 
from tester and driver mRNA are hybridized to an array of said clones (Hedrick et 
al. 1984, Sakaguchi et al. 1986). Differentially expressed clones will hybridize to 
tester probe, but not driver. The disadvantage of this approach is that rare species 
may not generate detectable hybridization signals. One option for those using SSH 
is to screen the clones using a labelled probe generated from the subtracted cDNA 
from which it was derived, and with a probe made from the reverse subtraction 
reaction (ClonTechniques 1997a). Since the SSH method enriches rare sequences, 
it should be possible to confirm the presence of clones representing low abundance 
genes. Despite this quick screening step, there is still the need to go back to the 
original mRNA and confirm the altered expression using a more quantitative 
approach. Although this may be achieved using Northern blots, the sensitivity is 
poor by today's high standards and one must rely on PCR methods for accurate and 
sensitive determinations (see below). 



Sequence analysis 

The majority of differential display procedures produce final products which are 
between 100 and lOOObp in size. However, this may considerably reduce the size of 
the sequence for analysis of the DNA databases. This in turn leads to a reduced 
confidence in the result — several families of genes have members whose DNA 
sequences are almost identical except in a few key stretches, e.g. the cytochrome 
P450 gene superfamily (Nelson etal. 1996). Thus, does the clone identified as being 
almost identical to gene X 0 really come from that gene, or its brother gene X, or its 
as yet undiscovered sister X 2 ? For example, using SSH, part of a gene was isolated, 
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which was up-regulated in the liver of rats exposed to Wy-14,643 and was identified 
by a FASTA search as being transferrin (data not shown). However, transferrin is 
known to be downregulated by hypolipidemic peroxisome proliferators such as Wy- 
14,643 (Hertz et al .1996), and this was confirmed with subsequent RT-PCR 
analysis. This suggests that the gene sequence isolated may belong to a gene which 
is closely related to transferrin, but is regulated by a different mechanism. 

A further problem associated with SH technology is redundancy. In most cases 
before SH is carried out, the cDN A population must first be simplified by restriction 
digestion. This is important for at least two reasons : 

(1) To reduce complexity— long cDNA fragments may form complex networks 
which prevent the formation of appropriate hybrids, especially at the high 
concentrations required for efficient hybridization. 

(2) Cutting the cDNAs into small fragments provides better representation of 
individual genes; This is because genes derived from related but distinct 
members of gene families often have similar coding sequences that may cross- 
hybridize and be eliminated during the subtraction procedure (Ko 1990). 
Furthermore, different fragments from the same cDNA may differ considerably 
in terms of hybridization and amplification and, thus, may not efficiently do one 
or the other (Wang and Brown 1991). Thus, some fragments from differentially 
expressed cDNAs may be eliminated during subtractive hybridization pro- 
cedures. However, other fragments may be enriched and isolated. As a 
consequence of this, some genes will be cut one or more times, giving rise to two 
or more fragments of different sizes. If those same genes are differentially 
expressed, then two or more of the different size fragments may come through 
as separate bands on the final differential display, increasing the observed 
redundancy and increasing the number of redundant sequencing reactions. 

Sequence comparisons also throw up another important point — at what degree 
of sequence similarity does one accept a result. Is 90% identitiy between. a gene 
derived from your model species and another acceptably close? Is 95% between 
your sequence and one from the same species also acceptable? This problem is 
particularly relevant when the forward and reverse sequence comparisons give 
similar sequences with completely different gene species! An arbitrary decision 
seems to be to allocate genes that are definite (95% and above similarity) and then 
group those between 60 and .95% as being related or possible homologues. 

Quantitative analysis 

At some point, one must give consideration to the quantitative analysis of the 
candidate genes, either as a means of confirming that they are truly differentially 
expressed, or in order to establish just what the differences are. Northern blot 
analysis is a popular approach as it is relatively easy and quick to perform. However, 
the major drawback with Northern blots is that they are often not sensitive enough 
to detect rare sequences. Since the majority of messages expressed in a cell are of low 
abundance (see table 1 ), this is a major problem . Consequently, RT-PCR may be the 
method of choice for confirming differential expression. Although the procedure is 
somewhat more complex than Northern analysis, requiring synthesis of primers and 
optimization of reaction conditions for each gene species, it is now possible to set up 
high throughput PGR systems using mulitchannel pipettes, 96 +-well plates and 



684 



J. C. Rockett et al. 



appropriate thermal cycling technology. Whilst quantitative analysis is more 
desirable, being more accurate and without reliance on an internal standard, the 
money and time needed to develop a competitor molecule is often excessive, 
especially when one might be examining tens or even hundreds of gene species. The 
use of semi-quantitative analysis is simpler, although still relatively involved. One 
must first of all choose an internal standard that does not change in the test cells 
compared to the controls. Numerous reference genes have been tried in the past, for 
example interferon-gamma (IFN-/, Frye et al. 1989), /?-actin (Heuval et al. 1994), 
glyceraldehyde-3-phosphate dehydrogenase (GAPDH, Wong et al. 1994), di- 
hydrofolate reductase (DHFR, Mohler and Butler 1991), 0-2 -microglobulin (0-2- 
m, Murphy et al. 1990), hypoxanthine phosphoribosyl transferase (HPRT, Foss et . 
al. 1998) and a number of others (ClonTechniques 1997b). Ideally, an internal 
standard should not change its level of expression in the cell regardless of cell age, 
stage in the cell cycle or through the effects of external stimuli. However, it has been 
shown on numerous occasions that the levels of most housekeeping genes currently 
used by the research community do in fact change under certain conditions and in 
different tissues (ClonTechniques 1997b). It is imperative, therefore, that pre- 
liminary experiments be carried out on a panel of housekeeping genes to establish 
their suitability for use in the model system. 

Interpretation of quantitative data must also be treated with caution. By 
comparing the lists of genes identified by differential expression one can perhaps 
gain insight into why two different species react in different ways to external stimuli. 
For example, rats and mice appear sensitive to the non-genotoxic effects of a wide 
range of peroxisome proliferators whilst Syrian hamsters and guinea pigs are largely 
resistant (Orton et al. 1984, Rodricks and Turnbull 1987, Lake et al. 1989, 1993, 
Makowska et al. 1992). A simplified approach to resolving the reason(s) why is to 
compare lists of up- and down-regulated genes in order to identify those which are 
expressed in only one species and, through background knowledge of the effects of 
the said gene, might suggest a mechanism of facilitated non-genotoxic carcinogenesis 
or protection. Of course, the situation is likely to be far more complex. Perhaps if 
there were one key gene protecting guinea pig from non-genotoxic effects and it was 
upregulated 50 times by PPs, the same gene might only be up-regulated five times 
in the rat. However, since both were noted to be upregulated, the importance of the 
gene may be overlooked. Just to complicate matters, a large change in expression 
does not necessarily mean a biologically important change. For example, what is the 
true relevance of gene Y which shows a 50-fold increase after a particular treatment, 
and gene Z which shows only a 5-fold increase? If one examines the literature one 
may find that historically, gene Y has often been shown to be up-regulated 40-60- 
fold by a number of unrelated stimuli— in light of this the 50-fold increase would 
appear less significant. However, the literature may show that gene Z has never been 
recorded as having more than doubled in expression — which makes your 5-fold 
increase all the more exciting. Perhaps even more interesting is if that same 5-fold 
increase has only been seen in related neoplasms or following treatment with related 
chemicals. . . ' 

Problems in using the differential display approach 

Differential display technology originally held promise of an easily obtainable 
'fingerprint* of those genes which are up- or down-regulated in test animals/cells in 
a developmental process or following exposure to given stimuli. However, it has 
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become clear that the fingerprinting process, whilst still valid, is much too complex 
to be represented by a single technique profile. This is because all differential display 
techniques have common and/or unique technical problems which preclude the 
isolation and identification of all those genes which show changes in expression. 
Furthermore, there are important genetic changes related to disease development 
which differential expression analysis is simply not designed to address. An example 
of this is the presence of small deletions, insertions, or point mutations such as those 
seen in activated oncogenes, tumour suppressor genes and individual poly- 
morphisms. Polymorphic variations, small though they usually are, are often 
regarded as being of paramount importance in explaining why some patients 
respond better than others to certain drug treatments (and, in logical extension, why 
some people are less affected by potentially dangerous xenobiotics/carcinogehs than 
others). The identification of such point mutations and naturally occurring 
polymorphisms requires the subsequent application of sequencing, SSCP, DGGE 
or TGGE to the gene of interest. Furthermore, differential display is not designed 
to address issues such as alternatively spliced gene species or whether an increased 
abundance of mRNA is a result of increased transcription or increased mRNA 
stability. 



Conclusions 

Perhaps the main advantage of open system differential display techniques is that 
they are not limited by extant theories or researcher bias in revealing genes which are 
differentially expressed, since they are designed to amplify all genes which 
demonstrate altered expression. This means that they are useful for the isolation of 
previously unknown genes which may turn out be useful biomarkers of a particular 
state or condition. At least one open system (SAGE) is also quantitative, thus 
eliminating the need to return to the original mRNA and carry out Northern/PCR 
analysis to confirm the result. However, the rapid progress of genome mapping 
projects means that over the next 5-10 years or so, the balance of experimental use 
will switch from open to closed differential display systems, particularly DNA 
arrays. Arrays are easier and faster to prepare and use, provide quantitative data, are 
suitable for high throughput analysis and can be tailored to look at specific signalling 
pathways or families of genes. Identification of all the gene sequences in human and 
common laboratory animals combined with improved DNA array technology, 
means that it will soon no longer be necessary to try to isolate differentially expressed 
genes using the technically more demanding open system approach. Thus, their 
main advantage (that of identifying unknown genes) will be largely eradicated. It is 
likely, therefore, that their sphere of application will be reduced to analysis of the 
less common laboratory species, since it will be some time yet before the genomes of 
such animals as zebrafish, electric eels, gerbils, crayfish and squid, for example, will 
be sequenced. 

Of course, in the end the question will always remain: What is the functional/ 
biological significance of the identified, differentially expressed genes? One 
persistent problem is understanding whether differentially expressed genes are a 
cause or consequence of the altered state. Furthermore, many chemicals, such as 
non-genotoxic carcinogens, are also mitogens and so genes associated with 
replication will also be upregulated but may have little or nothing to do with the 
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carcinogenic effect. Whilst differential display technology cannot hope to answer 
these questions, it does provide a springboard from which identification, regulatory 
and functional studies can be launched. Understanding the molecular mechanism of 
cellular responses is almost impossible without knowing the regulation and function 
of those genes and their condition (e.g. mutated). In an abstract sense, differential 
display can be likened to a still photograph, showing details of a fixed moment in 
time. Consider the Historian who knows the outcome of a battle and the placement 
and condition of the troops before the battle commenced, but is asked to try and 
deduce how the battle progressed and why it ended as it did from a few still 
photographs — an impossible task. In order to understand the battle, the Historian 
must find out the capabilities and motivation of the soldiers and their commanding 
officers, what the orders were and whether they were obeyed. He must examine the 
terrain, the remains of the battle and consider the effects the prevailing weather 
conditions exerted. Likewise, if mechanistic answers are to be forthcoming, the 
scientist must use differential display in combination with other techniques, such as 
knockout technology, the analysis of cell signalling pathways, mutation analysis and 
time and dose response analyses. Although this review has emphasized the 
importance of diff erential gene profiling, it should not be considered in isolation and 
the full impact of this approach will be strengthened if used in combination with 
functional genomics and proteomics (2-dimensional protein gels from isoelectric 
focusing and subsequent SDS electrophoresis and virtual 2D-maps using capillary 
electrophoresis). Proteomics is attracting much recent attention as many of the 
changes resulting in differential gene expression do not involve changes in mRNA 
levels, as decribed extensively herein, but rather protein-protein, protein-DNA and 
protein phosphorylation events which would require functional genomics or 
proteomic technologies for investigation. 

Despite the limitations of differential display technology, it is clear that many 
potential applications and benefits can be obtained from characterizing the genetic 
changes that occur in a cell during normal and disease development and in response 
to chemical or biological insult. In light of functional data, such profiling will 
provide a 'fingerprint* of each stage of development or response, and iri the long 
term should help in the elucidation of specific and sensitive biomarkers for different 
types of chemical /biological exposure and disease states. The potential medical and 
therapeutic benefits of understanding such molecular changes are almost im- 
measurable. Amongst other things, such fingerprints could indicate the family or 
even specific type of chemical an individual has been exposed to plus the length 
and/or acuteness of that exposure, thus indicating the most prudent treatment. 
They may also help uncover differences in histologically identical cancers, provide 
diagnostic tests for the earliest stages of neoplasia and, again, perhaps indicate the 
most efficacious treatment. 

The Human Genome Project will be completed early in the next century and the 
DNA sequence of all the human genes will be known. The continuing development 
and evolution of differential gene expression technology will ensure that this 
knowledge contributes fully to the understanding of human disease processes. 

Acknowledgements 

We acknowledge Drs Nick Plant (University of Surrey), Sally Darney and Chris 
Luft (US EPA at RTP) for their critical analysis of the manuscript prior to 
submission. This manuscript has been reviewed in accordance with the policy of the 



Differential gene expression 



687 



US Environmental Protection Agency and approved for publication. Approval does 
not signify that the contents reflect the views and policies of the Agency, nor does 
mention of trade names constitute endorsement or recommendation for use. 

References 

Adams, M. D., Kelley , J. M., Gocayne, J. D.,. Dubnick , M., Polymeropoulos , M. H., Xiao, H., 

Merril, C.R., Wu, A., Olde, B. ( Moreno, R.F., Kerlavage, A.R., McCombe , W. R. and 

Ventor , J. C, 1991, Complementary DNA sequencing: expressed sequence tags and human 

genome project. Science, 252, 1651-1656. 
An, G., Luo, G., Veltri , R. W. and O'Hara, S. M., 1996, Sensitive non-radioactive differential display 

method using chemiluminescent detection. Biotechniques, 20, 342-346. 
Axel , R., Feigelson , P. and Schultz, G., 1976, Analysis of the complexity and diversity of mRNA from 

chicken liver and oviduct. Cell, 7, 247-254. 
Band, V. and Sager, R., 1989, Distinctive traits of normal and tumor-derived human mammary 

epithelial cells expressed in a medium that supports long-term growth of both cell types. 

Proceedings of the Naional Academy of Sciences, USA, 86,1249-1253. 
Bauer, D., Muller, H., Reich, J., Riedel , H., Ahrenkiel , V., Warthoe, P. and Strauss, M., 1993, 

Identification of differentially expressed mRNA species by an improved display technique 

(DDRT-PCR). Nucleic Acids Research, 21, 4272-4280. 
Bertioli , D. J;, Schlichter , U. H. A., Adams, M. J., Burrows, P. R., Steinbiss , H.-H. and Antoniw , 

J. F., 1995, An analysis of differential display shows a strong bias towards high copy number 

mRNAs. Nucleic Acids Research, 23, 4520-4523. 
Bravo, R., 1990, Genes induced during the G0/G1 transition in mouse fibroblasts. Seminars in Cancer 

Biology, 1, 37-46. 

Burn, T. C, Petrovick , M.S., Hohaus, S., Rollins , B. J. and Tenen , D. G., 1994, Monocyte 

chemoattractant protein-1 gene is expressed in activated neutrophils and retinoic acid-induced 

human myeloid cell lines. Blood, 84, 2776-2783. 
Cao, J., Cai, X., Zheng, L., Geng, L., Shi, Z., Pao, C. C. and Zheng, S., 1997, Characterisation of 

colorectal cancer-related cDNA clones obtained by subtractive hybridisation screening. Journal of 

Cancer Research and Clinical Oncology, 123, 447-451. 
Cassidy , S. B., 1995, Uniparental disomy and genomic imprinting as causes of human genetic disease. 

Environmental and Molecular Mutagenesis, 25 (Suppl 26), 13-20. 
Chang , G. W. and Terzaghi -Howe, M., 1998, Multiple changes in gene expression are associated with 

normal cell-induced modulation of the neoplastic phenotype. Cancer Research, 58, 4445-4452. 
Chen , J., Schwartz, D. A., Young , T. A., Norris , J. S. and Yager, J. D., 1996, Identification of genes 

whose expression is altered during mitosuppression in livers of ethinyl estradiol- treated female 

rats. Carcinogenesis, 17, 2783-2786. 
Chen , J. J. W. and Peck, K., 1996, Non-radioactive differential display method to directly visualise and 

amplify differential bands on nylon membrane. Nucleic Acid Research, 24, 793-794. 
Clon Techniques , 1997a, PCR-Select Differential Screening Kit— the nextstep after Clontech PCR- 

S elect cDNA subtraction, ClonTechniques, XII, 18-19. 
Clon Techniques , 1997b, Housekeeping RT-PCR amplimers and cDNA probes. ClonTechniques, XII, 

15-16.. 

Davis , M. M., Cohen , D. I. ..Nielsen , E. A., Steinmetz , M/, Paul, W. E. and Hood, L., 1984, Cell- 
type-specific cDNA probes and the murine I region : the localization and orientation of Ad alpha. 
Proceedings of the National Academy of Sciences {USA), 81, 2194-2198. 

Dellavalle , R. P., Peterson , R. and Lindquist , S., 1994, Preferential deadenylation of HSP70 mRNA 
plays a key role in regulating Hsp70 expression in Drosophila melanogaster. Molecular arid Cell 
Biology, 14, 3646-3659. 

DeRisi , J. L., Vashwanath , R. L. and Brown , P., 1997, Exploring the metabolic and genetic control of 
gene expression on a genomic scale. Science, 278, 680-686. 

Diatchenko , L.,Lau,Y.-F. C, Campbell , A. P.,Chenchik ,A.,Moqadam ,F., Huang, B.,Lukyanov, 
K,, Gurskaya, N., Sverdlov , E. D. and Siebert , P. D., 1996, Suppression subtractive 
hybridisation: A method for generating differentially regulated or tissue-specific cDNA probes 
and libraries. Proceedings of the National Academy of Sciences (USA), 93, 6025-6030. 

Dogra, S. C, Whitelaw , M. L. and May, B. K., 1998, Transcriptional activation of cytochrome P450 
genes by different classes of chemical inducers. Clinical and Experimental Pharmacology and 
Physiology, 25, 1-9. 

Duguid, J. R. and Dinauer , M. C. ( 1990, Library subtraction of in vitro cDNA libraries to identify 
differentially expressed genes in scrapie infection. Nucleic Acids Research, 18, 2789-2792. 

Dunbar, P. R,, Ogg, G. S., Chen , J., Rust, N., van der Bruggen, P. and Cerundolo , V., 1998, Direct 
isolation, pbenotyping and cloning of low-frequency antigen-specific cytotoxic T lymphocytes 
from peripheral blood. Current Biology, 26, 413-416. 



688 



J, C. Rockettei al. 



Fitzpatrick ,D. R., Germain -Lee, E. and Valle, D., 1995, Isolation and characterisation of rat and 
human cDNAs encoding a novel putative peroxisomal enoyl-CoA hydratase. Genomics, 27, 
457-466. 

Foss, D. L., Baarsch, M.J. and Murtaugh, M.P., 1998, Regulation of hypoxanthine phospho- 

ribosyltransf erase, glyceraldehyde-3-phosphate dehydrogenase and beta-actin mRNA expression 

in porcine immune cells and tissues. Animal Biotechnology, 9, 67-78 . 
Frye, R. A., Benz, C. C. and Liu, E., 1989, Detection of amplified oncogenes by differential polymerase 

chain reaction; Oncogene, 4, 1153-1157. 
Geisinger , A., Rodriguez, R., Romero , V. and Wettstein R., 1997, A. simple method for screening 

cDNAs arising from the cloning of RNA differential display bands. Elsevier Trends Journals 

Technical Tips Online, http :/ /tto. trends.com, document T01110. 
Gress, T. M., Hoheisel , J. D., Lennon , G. G., Zehetner , G. and Lehrach , H., 1992, Hybridisation 

fingerprinting of high density cDNA filter arrays with cDNA pools derived from whole tissues. 

Mammalian Genome, 3, 609-619. 
Griffin , G. and Krishna , S., 1998, Cytokines in infectious diseases. Journal of the Royal College of 

Physicians, London, 32,195-198. 
Groenink , M. and Leegwater , A, C. J., 1996, Isolation of delayed early genes associated with liver 

regeneration using Clontech PCR -select subtraction technique. Clontechniques, XI, 23-24. 
Guimaraes, M. J., Bazan, J. F., Zlotndc , A., Wiles, M.V., Grimaldi , J. C., Lee, F,, and 

McClanahan , T., 1 995b, A new approach to the study of haematopoietic development in the yolk 

sac and embryoid bodies. Development, 121, 3335-3346. 
Guimeraes , M.J., Lee, F., Zlotnik , A. and McClanahan, T., 1995a, Differential display by 

PCR:novel findings and applications. Nucleic Acids Research, 23, 1832-1833. 
Gurskaya, N. G., Diatchenko , L., Chenchik , P. D., Siebert , P. D. ( Khaspekov , G. L., Lukyanov , 

K.A., Vagner, L.L., Ermolaeva , O. D., Lukyanov , S. A. and Sverdlov, E.D., 1996, 

Equalising cDNA subtraction based on selective suppression of polymerase chain reaction: 

Cloning of Jurkat cell transcripts induced by phytohemaglutinin and phorbol 12-Myrystate 13- 

Acetate. Analytical B iochemistry, 240, 90-97. 
Hampson , I. N. and Hampson , L., 1997, CCLS and DROP— subtractive cloning made easy. Life Science 

News (A publication of Amersham Life Science), 23, 22-24. 
Hampson , I. N., Hampson , L. and Dexter, T. M., 1996, Directional random oligonucleotide primed 

(DROP) global amplification of cDNA: its application to subtractive cDNA cloning. Nucleic 

Acids Research. 24, 4832-4835. 
Hampson , I. N.,Pope, L., Cowling , G. J. and Dexter, T. M., 1992, Chemical cross linking subtraction 

(CCLS): a new method for the generation of subtractive hybridisation probes. Nucleic Acids 

Research, 20, 2899. 

Hara, E., Kato, T., Nakada, S., Sekiya, S. and Oda, K., 1991, Subtractive cDNA cloning using 

oligo(dT)30-latex and PCR: isolation of cDNA clones specific to undifferentiated human 

embryonal carcinoma cells. Nucleic Acids Research, 19, 7097-7104. . . . 

Hatada, I., Hayashizake, Y., Hirotsune , S., Komatsubara , H. and Mukai, T., 1991, A genomic 

scanning method for higher organisms using restriction . sites as landmarks. Proceedings of the 

National Academy of Sciences (USA), 88, 9523-9527. 
Hecht, N., 1998, Molecular mechanisms of male sperm cell differentiation. B toes says,. 20, 555-561. 
Hedrick, S., Cohen, D. I., Nielsen, E. A. and Davis, M.E., 1984, Isolation of T cell-specific 

membrane-associated proteins. Nature, 308, 149-153. 
Hertz, R., Seckbach , M., Zakin , M. M. and Bar-Tana, J., 1996, Transcriptional suppression of the 

transferrin gene by hypolipidemic peroxisome proliferators. Journal of Biological Chemistry, 271, 

218-224. 

Heuval , J. P. V., Clark, G. C, Kohn , M. C, Tritscher , A. M., Greenlee , W. F., Lucter , G. W. and 
Bell, D. A., 1994, Dioxin-responsive genes: Examination of dose-response relationships using 
quantitative reverse transciptase- polymerase chain reaction. Cancer Research, 54, 62-68. 

Hillier , L. D., Lennon , G., Becker, M., Bonaldo , M. F., Chiapelli , B., Chissoe , S., Dietrich , N., 
DuBuque, T.,Favello , A., Gish , W., Hawkins , M., Hultman , M., Kucaba, T., Lacy, M.,Le, 
M.,Le,N., Mardis, E., Moore, B., Morris, M., Parsons, J., Prange, C, Rifkin ,L.,Rohlfing , 
T., Schellenberg , K., Soares, M. B., Tan, F., Thierry -Meg, J., Trevaskis , E., Underwood , 
K., Wohldman , P., Waterston , R., Wilson , R and Marra, M., 1996, Generation and analysis, 
of 280,000 human expressed sequence tags. Genome Research, 6, 807-828. 

Hubank, M. and Schatz, D. G., 1994, Identifying differences in mRNA expression by representational 
difference analysis. Nucleic Acids Research, 22, 5640-5648. 

Hunter , T., 1991, Cooperation between oncogenes. Cell, 64, 249-270. 

Ivanova ,N. B.and Belyavsky , A. V., 1995, Identification of differentially expressed genes by restriction 

en donucl ease-based gene expression fingerprinting. Nucleic Acids Research, 23, 2954-2958, 
James , B. D. and Higgins , S. J, 1985, Nucleic Acid Hybridisation (Oxford : IRL Press Ltd). 
Kas-Deelen , A. M., Harmsen , M. C, de Maar, E. F. and van Son.W. J, 1998, A sensitive method for 



Differential gene expressicm 



689 



quantifying cytomegalic endothelial cells in peripheral blood from cytomegalovirus-infected 

patients. Clinical Diagnostic and Laboratory Immunology, 5, 622-626. • . ■ - 

Kilty , I. and Vickers , P., 1997, Fractionating DNA fragments generated by differential display PCR. 

Strategies Newsletter (Stratagene), 10, 50-51. 
Kleinjan , D.-J. and van Heyningen , V., 1998, Position effect in human genetic disease: Hum an and . 

Molecular Genetics , 7 , 1 6 1 1 -1 6 1 8 . 
Ko, M. S., 1990, An 'equalized cDNA library' by the reassociation of short double-stranded cDNAs; 

Nucleic Acids Research, 18, 5705-5711. 
Lake, B. G., Evans, J. G., Cunninghame , M. E. and Price, R. J., 1993, Comparison of the hepatic 

effects of Wy-14,643 on peroxisome proliferation and cell replication in the rat and Syrian 

hamster. Environmental Health Per spectives, 101, 241-248. 
Lake, B. G., Evans , J. G., Gray, T. J. B., Korosi , S. A. and North , C. J., 1989, Comparative studies 

of nafenopin-indiiced hepatic peroxisome proliferation in the rat, Syrian hamster > guiea pig and 

marmoset. Toxicology* and Applied Pharmacology, 99, 148-160. 
Lennard , M.S., 1993, Genetically determined adverse drug reactions involving metabolism. Drug 

Safety, 9, 60-77. 

Levy, S., Todd, S. C. and Maecker, H. T., 1998, CD81(TAPA-1): a molecule involved in signal 

transduction and cell adhesion in the immune system. Annual Review of Immunology, 16, 89-109. 
Liang, P. and Pardee, A. B., 1992, Differential display of eukaryotic messenger RNA by means of the 

polymerase chain reaction. Science, 257, 967-971. 
Liang, P., Averboukh , L., Keyomarsi , K., Sager, R. and Pardee, A., 1992, Differential display and 

cloning of messenger RNAs from human breast cancer versus mammary epithelial cells. Cancer 

Research, 52, 6966-6968. 

Liang, P., Averboukh , L. and Pardee, A. B., 1993, Distribution & cloning of eukaryotic mRN As by 

means of differential display refinements and optimisation. Nucleic Acids Research, 21, 3269-3275. 
Liang, P., Bauer, D., Averboukh , L., Warthoe, P., Rohrwild , M., Muller, H., Strauss, M. and 

Pardee, A. B., 1995, Analysis of altered gene expression by differential display. Methods in 

Enzymology, 254, 304-321. 
Linskens , M. H., Feng, J., Andrews, W. H., Enlow, B. E., Saati, S. M., Tonkin , L. A., Funk; 

W. D. and Villeponteau , B., 1995, Cataloging altered gene expression in young and senescent 

cells using enhanced differential display. Nucleic Acids Research, 23, 3244-3251. 
Lisitsyn , N.,.Lisiitsyn , N. and Wigler , M., 1993, Cloning the differences between two complex 

genomes. Science, 259, 946-951. 
Lohmann , J., Schickle , H. and Bosch, T. C. G., 1995, REN Display, a rapid and efficient method for 

non-radioactive differential display and mRNA isolation. Biotechniques , 18, 200-202. 
Lunney , J. K., 1998, Cytokines orchestrating the immune response. Reviews in Science and Techology, 

17, 84-94. . . . . .. . 

Makowska, J.M., Gibson, CG. and Bonner, F.W., 1992, Species differences in ciprofibrate- 

induction of hepaic cytochrome P4504A1 and peroxisome proliferation. Journal of Biochemical 

Toxicology, 1, 183-191. 

Maldarelli , F., Xiang , C, Chamoun , G. and Zeichner , S. L., 1998, The expression of the essential . 
nuclear splicing factor SC35 is altered by human immunodeficiency virus infection. Virus 
Research, 53, 39-51. 

Mathieu -Daude, F., Cheng , R., Welsh , J. and McClelland , M., 1996, Screening of differentially 
amplified cDNA products from RNA arbitrarily primed PCR fingerprints using single strand 
conformation polymorphism (SSCP) gels. Nucleic Acids Research, 24, 1504-1507. 

McKenzie, D. and Drake, D., 1997, Identification of differentially expressed gene products with the 
castaway system. Strategies Newsletter (Stratagene), 10,19-20. 

McClelland, M., Mathieu -Daude, F. and Welsh, J., 1996, RNA fingerprinting and differential 
display using arbitrarily primed PCR. Trends in Genetics, 11, 242-246. 

Mechler , B. and Rabbitts , T. H., 1981, Membrane-bound ribosomes of myeloma cells. IV. mRNA 
complexity of free and membrane-bound polysomes. Journal of Cell Biology, 88, 29-36. 

Meyer, U. A. and Zanger, U. M., 1997, Molecular mechanisms of genetic polymorphisms of drug 
metabolism. Annual Review of Pharmacology and Toxicology, 37, 269-296. 

Mohler , K. M, and Butler , L. D., 1991, Quantitation of cytokine mRNA levels utilizing the reverse 
transcriptase-polymerase chain reaction following primary antigen-specific sensitization in 
vivo — I. Verification of linearity, reproducibility and specificity. Molecular Immunology, 28, 
437-447. 

Murphy, L. D., Herzog, C. E., Rudick, J. B., Tito Fojo, A. and Bates, S. E., 1990, Use of the 
polymerase chain reaction in the quantitation of the mdr-1 gene expression. Biochemistry , 29, 
10351-10356. 

Nelson, D. R., Koymans , L., Kamataki , T., Stegeman , J. J., Feyereisen , R., Waxman , D. J., 
Waterman , M. R., Gotoh , O., Coon, M. J., Estabtrook , R. W., Gunsalus , I, C. and Nebert, 
D. W., 1996, Update on new sequences, gene mapping, accession numbers and nomenclature. 
Pharmacogenetics, 6, 1-42. 



690 



J. C. Rockett et al. 



Nishio , Y.,Aiello ( L. P. and King., G. L., 1994, Glucose induced genes in bovine aortic smooth muscle 

cells identified by mRNA differential display. FASEB Journal, 8, 103-106. 
O'Neill , M. J. and Sinclair , A. H., 1997, Isolation of rare transcripts by representational difference 

analysis. Nucleic Acids Research, 25, 2681-2682. 
Orton,T. C, Adam, H. K., Bentley , M . , Holloway , B, and Tucker, M. J., 1984, Clobuzarit: species 

differences in the morphological and biochemical response of the liver following chronic 

administration. Toxicology and Applied Pharmacology, 73, 138-151. 
Pelkonen , O., Maenpaa, J., Taavttsainen , P., Rautio, A. and Raunio , H. ( 1998, Inhibition and 

Induction. of human cytochrome P450 (CYP) enzymes. Xenobiotica , 28, 1203-1253. 
Philips , S. M., Bendall, A.J. and Ramshaw , I. A., 1990, Isolation of genes associated with high 

metastatic potential in rat mammary adenocarcinomas. Journal of the National Cancer Institute, 

82, 199-203. 

. Prashar, Y. and Weissman , S. M., 1996, Analysis of differential gene expression by display of 3'end 
restriction fragments of cDNAs. Proceedings of the National Academy of Sciences (USA), 93, 
659-663. 

Ragno, S., Estrada, I., Butler, R. and Colston, M. J. t 1997, Regulation of macrophage gene 

expression following invasion by Mycobacterium tuberculosis. Immunology Letters, 57, 143-146. 
Ramana, K. V. and Kohli , K. K., 1998, Gene regulation of cytochrome P450 — an overview. Indian 

Journal of Experimental Biology, 36, 437-446. 
Richard , L., Velasco , P. and Detmar , M., 1998, A simple immunomagnetic protocol for the selective 

isolation and long-term culture of human dermal microvascular endothelial cells. Experimental 

Cell Research, 240, 1-6. 

Rockett, J. C., Esdaile , D.J. and Gibson, G. G., 1997, Molecular profiling of non-genotoxic 
hepatocarcinogenesis using differential display reverse transcription -polymerase chain reaction 
(ddRT-PCR). European Journal of Drug. Metabolism and Pharmacokinetics, 22, 329-333. 

Rodricks, J. V, and Turnbull , D., 1987, Inter-species differences in peroxisomes and peroxisome 
proliferation. Toxicology and Industrial Health, 3, 197-212. 

ROGLER, G., HAUSMANN , M., VOGL , D. ASCHENBRENNER , E'., ANDUS , T., FaLK, W., ANDREESEN , R., 

S c h olmerich , J. and Gross, V., 1998, Isolation and phenotypic characterization of colonic 

macrophages. Clinical and Experimen tal Immunology, 112, 205-215. 
Rohn , W. M., Lee, Y. J. and Benveniste , E. N., 1996, Regulation of class II MHC expression. Critical 

Reviews in Immunology, 16 >, 311-330. 
Rudin, C. M. and Thompson , C. B., 1998, B-cell development and maturation. Seminars in Oncology, 

25, 435^46. 

Sakaguchi , N., Berger, C. N. and Melchers , F., 1986, Isolation of a cDNA copy of an RNA species 

expressed in murine pre-B cells. EMBO Journal, 5, 2139-2147. 
Sambrook , J., Fritsch , E. F. and Maniatis , T,, 1989, Gel electrophoresis of DNA. In N. Ford, M. 

Nolan and M. Fergusen (eds), Molecular Cloning — A laboratory manual, 2nd edition (New York: 

Cold Spring Harbour Laboratory Press), Volume 1, pp. 6-37. 
Sargent, T. D. and Dawid, I. B., 1983, Differential gene expression in the gastrula of Xenopus laevis. 

Science, 111, 135-139. ' 

Schena , M., Shalon , D., Heller , R., Chai , A., Brown , P. O. and Davis , R. W., 1996, Parallel human 

genome analysis: Microarray-based expression monitoring of 1000 genes. Proceedings of the 

National Academy of Sciences (t/S;4), 93, 10614-10619. 
Schneider , C., King , R. M. and Philipson , L., 1988, Genes specifically expressed at growth arrest of 

mammalian cells. Cell, 54, 787-793. 
Schneider -Maunoury , S., Gilardi -Hebenstreit , P. and Charnay, P., 1998, How to build a vertebrate 

hindbrain. Lessons from genetics. C R Academy of Science III, 321, . 81 9-834. 
Semenza , G. L., 1994, Transcriptional regulation of gene expression: mechanisms and pathophysiology. 

Human Mutations, 3, 180-199. 
Sewall, C. H., Bell , D. A., Clakk, G. C m Tritscher , A. M., Tully, D. B., Vanden Heuvel , J. and 

Lucier, G. W., 1995, Induced gene transcription: implications for biomarkers. Clinical 

Chemistry, 41, 1829-1834. 
Singh , N., Agrawal, S. and Rastogi , A. K;, 1997, Infectious diseases and immunity: special reference 

to major histocompatibility complex. Emerging Infectious Diseases, 3, 41-49. 
Smith , N. R., Li, A., Aldersley , M., High , A. S., Markham , A. F. and Robinson ,P. A., 1997, Rapid 

determination of the complexity of cDNA bands extracted from DDRT-PCR polyacrylamide 

gels. Nucleic Acids Research, 25 , 3552-3554. 
Sompayrac , L., Jane, S., Burn., T. C., Tenen ,D. G. and Danna, K. J., 1995, Overcoming limitations 

of the mRNA differential display technique. Nucleic Acids Research, 23, 4738-4739. 
St John, T. P. and Davis, R. W,, 1979, Isolation of galactose-inducible DNA sequences from 

Saccharomyces cerevisiae by differential plaque filter hybridisation. Cell, 16, 443-452. 
Sun, Y., Hegamyer , G. and Colburn , N.H., 1994, Molecular cloning of five messenger RNAs 

differentially expressed in preneoplastic or neoplastic JB6 mouse epidermal cells: one is 

homologous to human tissue inhibitor of metalloproteinases-3. Cancer Research, 54, 1139-1144. 



Differential gene expression 



691 



Sung , Y. J. and Denman , R. B., 1997, Use of two reverse transcriptases eliminates false -positive results 

in differential display. Biotechniques, 23 , 462-464. 
Sutton, G., White, O., Adams, M. and Kerlavage, A., 1995, TIGR Assembler; A new tool for 

assembling large shotgun sequencing projects. Genome Science and Technology, 1, 9-19. 
Suzuki, Y., Sekiya , T. and Hayashi , K., 1991, Allele-specific polymerase chain reaction: a method for . 

amplification and sequence determination of a single component among a mixture of sequence 

variants. Analytical Biochemistry, 192, 82-84. 
Syed, V., Gu, W. and Hecht , N. B., 1997, Sertoli cells inculture and mRNA differentia ldisplay provide 

a sensitive early warning assay system to detect changes induced by xenobiotics. Journal of 

Androtogy, 18, 264-273. 

Uitterlinden , A.G., Slagboom , P., Knook, D. L. and Vugl, J., 1989, Two-dimensional DNA 
fingerprinting of human individuals. Proceedings of the National Academy of Sciences (USA), 86, 
2742-2746. ■ 

Ullman , K. S., Northrop , J. P., Verweij , C. L. and Crabtree, G. R., 1990, Transmission of signals 

from the T lymphocyte antigen receptor to the genes responsible for cell proliferation and immune 

function: the missing link. Annual Review of Immunology, 8, 421-452. 
Vasmatzis , G., Essand, M., Brinkmann , U., Lee, B. and Paston , I., 1998, Discovery of three genes 

specifically expressed in human prostate by expressed sequence tag database analysis. Proceedings 

of the National Academy of Sciences ( US A ) , 95 , 3 00-3 04 . 
Velculescu , V.E., Zhang, L., Vogelstein , B. and Kinzler, K, W., 1995, Serial analysis of gene 

expression. Science, 270, 484-487. 
Voeltz, G. K. and Stettz , J. A., 1998, AuuuA sequences direct mRNA deadenylation uncoupled from 

decay during Xenopus early development. Molecular and Cell Biology, 18, 7537-7545. 
Vogelstein , B. and Kinzler, K. W., 1993, The multistep nature of cancer. Trends in Genetics, 9, 

138-141. 

Walter, J., Belfield , M., Hampson , I. and Read, C, 1997, A novel approach for generating subtractive 

probes for differential screening by CCLS. Life Science News, 21; 13-14. 
Wan, J. S., Sharp, S. J., Poirer , G. M.-C., Wagaman , P. C, Chambers , J., Pyati, J., Hom , Y.-L., 

Galindo , J.E., Huvar, A., Peterson, P. A., Jackson, M. R. and Erlander, M. G., 1996, 

Cloning differentially expressed mRNAs. Nature Biotechnology, 14, 1685-1691. 
Walter, J., Belfield , M., Hampson , I. and Read, C, 1997, A novel approach for generating subtractive 

probes for differential screening by CCLS, Life Science News, 21, 13-14. 
Wang, Z. and Brown, D. D. 1991, A gene expression screen. Proceedings of the National Academy of 

Sciences (USA), 88, 11505-11509. 
Wawer, C, Ruggeberg, H., Meyer, G. and Muyzer, G., 1995, A simple and rapid electrophoresis 

method to detect sequence variation in PCR-amplified DNA fragments. Nucleic Acids Research, 

23,4928-4929. 

Welsh , J., Chada, K., Dalal, S. S., Cheng, R., Ralph, D. and McClelland , M., 1992, Arbitrarily 
primed PCR fingerprinting of RNA. Nucleic Acids Research, 20, 4965-4970. 

Wong, H., Anderson , W. D., Cheng , T. and Riabowol , K. T., 1994, Monitoring mRNA expression 
by polymerase chain reaction: the 'primer-dropping' method. Analytical Biochemistry, 223, 
251-258., 

Wong, K. K. and McClelland , M., 1994, Stress-inducible gene of Salmonella typhimurium identified 
by arbitrarily primed PCR of RNA. Proceedings of the National Academy of Sciences (U SA), 91, 
639-643. 

Wynford -Thomas , D., 1991, Oncogenes and anti-oncogenes; the molecular basis of tumour behaviour. 

Journal oj Pathology, 165,187-201. — 
Xhu,D., Chan, W. L., Leung , B. P., Huang, F. P., Wheeler , R., Piedrafita , D., Robinson , J. H. and 

Lew , F. Y., 1 998, Selective expression of a stable cell surface molecule on type 2 but not type 1 

helper T cells. Journal of Experimental Medicine, 187, 787-794 . 
Yang,. M. and Sytowski , A. J., 1996, Cloning differentially expressed genes by linker capture 

subtraction. Analytical Biochemistry, 237, 109-114. 
Zhao, N., Hashida, H., Takahashi , N., Misumi , Y. and Sakaki, Y., 1995, High-density cDNA filter 

analysis: a novel approach for large scale quantitative analysis of gene expression. Gene, 156, 
■ 207-213. 

Zhao,X. J.,Newsome , J.T. and Cihlar ,R. L., 1998, Up-regulation of two Candida albicans genes in the 
rat model of oral candidiasis detected by differential display. Microbial Pathogenesis, 25, 121-129. 

Zimmermann , C.R., Orr, W. C, Leclerc, R. F., Barnard, C. and Timberlake , W. E., 1980, 
Molecular cloning and selection of genes regulated in Aspergillus development. Cell, 21, 709-715. 



i Docket No.: PF-0621 USN 

USSN: 09/830,914 
, Ref. No. 2 of 8 

Proc. Natl Acad. ScL USA ! — — 

Vol. 94, pp. 8945-8947, August 1997 
Applied Biological Sciences 



Whole genome analysis: Experimental access to all genome 
sequenced segments through larger-scale efficient 
oligonucleotide synthesis and PGR 

DEVAL A. LASHKARI*t, JOHN H. McCuSKER*, AND RONALD W. DAVIS*§ 

•Departments of Genetics and Biochemistry, Beckman Center, Stanford University, Stanford, CA 94305; and ^Department of Microbiology, 3020 Duke University 
Medical Center, Durham, NC 27710 



Contributed by Ronald W. Davis, May 20, 1997 

ABSTRACT The recent ability to sequence whole genomes 
allows ready access to all genetic material. The approaches 
outlined here allow automated analysis of sequence for the 
synthesis of optimal primers in an automated multiplex 
oligonucleotide synthesizer (AMOS). The efficiency is such 
that all ORFs for an organism can be amplified by PCR. The 
resulting amplicons can be used directly in the construction of 
DNA arrays or can be cloned for a large variety of functional 
analyses. These tools allow a replacement of single-gene 
analysis with a highly efficient whole-genome analysis. 



The genome sequencing projects have generated and will 
continue to generate enormous amounts of sequence data. The 
genomes of Saccharomyces cerevisiae, Escherichia co\i y Hae- 
mophilus influenzae (1), Mycoplasma genitalium (2), and Meth- 
anococcus jannaschii (3) have been completely sequenced. 
Other model organisms have had substantial portions of their 
genomes sequenced as well, including the nematode Caeno- 
rhabditis elegans (4) and the small flowering plant Arabidopsis 
thaliana (5). This massive and increasing amount of sequence 
information allows the development of novel experimental 
approaches to identify gene function. 

One standard use of genome sequence data is to attempt to 
identify the functions of predicted open reading frames 
(ORFs) within the genome by comparison to genes of known 
function. Such a comparative analysis of all ORFs to existing 
sequence data is fast, simple, and requires no experimentation 
and is therefore a reasonable first step. While finding sequence 
homologies/motifs is not a substitute for experimentation, 
noting the presence of sequence homology and/or sequence 
motifs can be a useful first step in finding interesting genes, in 
designing experiments and, in some cases, predicting function. 
However, this type of analysis is frequently un informative. For 
example, over one-half of new ORFs in S. cerevisiae have no 
known function (6). If this is the case in a well studied organism 
such as yeast, the problem will be even worse in organisms that 
are less well studied or less manipulate. A large, experimen- 
tally determined gene function database would make homol- 
ogy/motif searches much more useful. 

Experimental analysis must be performed to thoroughly 
understand the biological function of a gene product. Scaling 
up from classical "cottage industry" one-gene-oriented ap- 
proaches to whole-genome analysis would be very expensive 
and laborious. It is clear that novel strategies are necessary to 
efficiently pursue the next phase of the genome projects — 
whole-genome experimental analysis to explore gene expres- 
sion, gene product function, and other genome functions. 
Model organisms, such as S. cerevisiae, will be extremely 
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important in the development of novel whole-genome analysis 
techniques and, subsequently, in improving our understanding 
of other more complex and less manipulable organisms. 

The genome sequence can be systematically used as a tool 
to understand ORFs, gene product function, and other ge- 
nome regions. Toward this end, a directed strategy has been 
developed for exploiting sequence information as a means of 
providing information about biological function (Fig. 1). Ef- 
forts have been directed toward the amplification of each 
predicted ORF of any other region of the genome ranging 
from a few base pairs to several kilobase pairs. There are many 
uses for these amplicons — they can be cloned into standard 
vectors or specialized expression vectors, or can be cloned into 
other specialized vectors such as those used for two-hybrid 
analysis. The amplicons can also be used directly by, for 
example, arraying onto glass for expression analysis, for DNA 
binding assays, or for any direct DNA assay (7). As a pilot 
study, synthetic primers were made on the 96-well automated 
multiplex oligonucleotide synthesizer (AMOS) instrument (8) 
(Fig. 2). These oligonucleotides were used to amplify each 
ORF on yeast chromosome V. The current version of this 
instrument can synthesize three plates of 96 oligonucleotides 
each (25 bases) in an 8-hr day. The amplification of the entire 
set of PCR products was then analyzed by gel electrophoresis 
(Fig. 3). Successful amplification of the proper length product 
on the first attempt was 95%. This project demonstrates that 
one can go directly from sequence information to biological 
analysis in a truly automated, totally directed manner. 

These amplicons can be incorporated directly in arrays or 
the amplicons can be cloned. If the amplicons are to be cloned, 
novel sequences can be incorporated at the 5' end of the 
oligonucleotide to facilitate cloning. One potential problem 
with cloning PCR products is that the cloned amplicons may 
contain sequence alterations that diminish their utility. One 
option would be to resequence each individual amplicon. 
However, this is expensive, inefficient, and time consuming. A 
faster, more cost-effective, and more accurate approach is to 
apply comparative sequencing by denaturing HPLC (9). This 
method is capable of detecting a single base change in a 2-kb 
heteroduplex. Longer amplicons can be analyzed by use of 
appropriate restriction fragments. If any change is detected in 
a clone, an alternate clone of the same region can be analyzed. 
Modifying the system to allow high throughput analysis by 
denaturing HPLC is also relatively simple and straightforward. 

If amplicons are used directly on arrays without cloning, it 
is important to note that, even if single PCR product bands are 
observed on gels, the PCR products will be contaminated with 
various amounts of other sequences. This contamination has 
the potential to affect the results in, for example, expression 
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Fig. 1. Overview of systematic method for isolating individual 
genes. Sequence information is obtained automatically from sequence 
databases. The data are input into primer selection software specifi- 
cally designed to target ORFs as designated by database annotations. 
The output file containing the primer information is directly read by 
a high-throughput oligonucleotide synthesizer, which makes the oli- 
gonucleotides in 96-well plates (AMOS, automated multiplex oligo- 
nucleotide synthesizer). The forward and reverse primers are synthe- 
sized in the same location on separate plates to facilitate the down- 
stream handling of primers. The amplicons are generated by PCR in 
96-well plates as well. 

analysis. On the other hand, direct use of the amplicons is 
much less labor intensive and greatly decreases the occurrence 
of mistakes in clone identification, a ubiquitous problem 
associated with large clone set archiving and retrieving. 

Any large-scale effort to capture each ORF within a genome 
must rely on automation if cost is to be minimized while 
efficiency is maximized. Toward that end, primers targeting 
ORFs were designed automatically using simple new scripts 
and existing primer selection software. These script-selected 
primer sequences were directly read by the high-throughput 
synthesizer and the forward and reverse primers were synthe- 
sized in separate plates in corresponding wells to facilitate 
automated pipetting and PCR amplifications. Each of the 
resulting PCR products, generated with minimum labor, con- 
tains a known, unique ORF. 

Large-scale genome analysis projects are dependent on 
newly emerging technologies to make the studies practical and 
economically feasible. For example, the cost of the primers, a 
significant issue in the past, has been reduced dramatically to 
make feasible this and other projects that require tens of 
thousands of oligonucleotides. Other methods of high- 
throughput analysis are also vital to the success of functional 
analysis projects, such as microarraying and oligonucleotide 
chip methods (10-14). 

Changes in attitude are also required. One of the major costs 
of commercial oligonucleotides is extensive quality control 
such that virtually 100% of the supplied oligonucleotides are 
successfully synthesized and work for their intended purpose. 
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Fig. 2. Overall approach for using database of a genome to direct 
biological analysis. The synthesis of the 6,000 ORFs (orfs) for each 
gene of 5. cerevisiae can be used in many applications utilizing both 
cloning and microarraying technology. 

Considerable cost reduction can be obtained by simply de- 
creasing the expected successful synthesis rate to 95-97%. One 
can then achieve faster and cheaper whole genome coverage by 
simply adding a single quality control at the end of the 
experiment and batching the failures for resynthesis. 

The directed nature of the amplicon approach is of clear 
advantage. The sequence of each ORF is analyzed automati- 
cally, and unique specific primers are made to target each 
ORF. Thus, there is relatively little time or labor involved — for 
example, no random cloning and subsequent screening is 
required because each product is known. In the test system, 
primers for 240 ORFs from chromosome V were systematically 
synthesized, beginning from the left arm and continuing 
through to the right arm. At no point was there any manual 
analysis of sequence information to generate the collection. In 
many ways, now that the sequence is known, there is no need 
for the researcher to examine it. 

These amplicons can be arrayed and expression analysis can 
be done on all arrayed ORFs with a single hybridization (10). 
Those ORFs that display significant differential expression 
patterns under a given selection are easily identified without 
the laborious task of searching for and then sequencing a clone. 
Once scaled up, the procedure provides even greater returns 
on effort, because a single hybridization will ultimately provide 
a "snapshot" of the expression of all genes in the yeast genome. 
Thus, the limiting factor in whole genome analysis will not be 
the analysis process itself, but will instead be the ability of 
researchers to design and carry out experimental selections. 

Current expression and genetic analysis technologies are 
geared toward the analysis of single genes and are ill suited to 
analyze numerous genes under many conditions. Additional 
difficulties with current technologies include: the effort and 
expense required to analyze expression and make mutants, the 
potential duplication of effort if done by different laboratories, 
and the possibility of conflicting results obtained from differ- 
ent laboratories. In contrast, whole genome analysis not only 
is more efficient, it also provides data of much higher quality; 
all genes are assayed and compared in parallel under exactly 
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Fig. 3. Gel image of amplifications. Using the method described in Fig. 1, amplicons were generated for ORFs of S. cerevisiae chromosome 
V. One plate of 96 amplification reactions is shown. 



the same conditions. In addition, amplicons have many appli- 
cations beyond gene expression. For example, one recent 
approach is to incorporate a unique DNA sequence tag, 
synthesized as part of each gene specific primer, during 
amplification. The tags or molecular bar codes, when reintro- 
duced into the organism as a gene deletion or as a gene clone, 
can be used much more efficiently than individual mutations 
or clones because pools of tagged mutants or transformants 
can be analyzed in parallel. This parallel analysis is possible 
because the tags are readily and quantitatively amplified even 
in complex mixtures of tags (13). 

These ORF genome arrays and oligonucleotide tagged 
libraries can be used for many applications. Any conventional 
selection applied to a library that gives discrete or multiple 
products can use these technologies for a simple direct read- 
out. These include screens and selections for mutant comple- 
mentation, overexpression suppression (15, 16), second-site 
suppressors, synthetic lethality, drug target overexpression 
.(17), two-hybrid screens (18), genome mismatch scanning (19), 
or recombination mapping. 

The genome projects have provided researchers with a vast 
amount of information. These data must be used efficiently 
and systematically to gain a truly comprehensive understand- 
ing of gene function and, more broadly, of the entire genome 
which can then be applied to other organisms. Such global 
approaches are essential if we are to gain an understanding of 
the living cell. This understanding should come from the 
viewpoint of the integration of complex regulatory networks, 
the individual roles and interactions of thousands of functional 
gene products, and the effect of environmental changes on 
both gene regulatory networks and the roles of all gene 
products. The time has come to switch from the analysis of a 
single gene to the analysis of the whole genome. 

Support was provided by National Institutes of Health Grants 
R37H60198 and P01H600205. 
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INTRODUCTION 

Technological advancements combined with in- 
tensive DNA sequencing efforts have generated an 
enormous database of sequence information over the 
past decade. To date, more than 3 million sequences, 
totaling over 2.2 billion bases [1], are contained 
within the GenBank database, which includes the 
complete sequences of 19 different organisms [2]. The 
first complete sequence of a free-living organism, 
Haemophilus influenzae, was reported in 1995 [3] and 
was followed shortly thereafter by the first complete 
sequence of a eukaryote, Saccharomyces cervisiae [4]. 
The development of dramatically improved sequenc- 
ing methodologies promises that complete elucida- 
tion of the Homo sapiens DNA sequence is not far 
behind [5]. 

To exploit more fully the wealth of new sequence 
information, it was necessary to develop novel meth- 
ods for the high-throughput or parallel monitoring 
of gene expression. Established methods such as 
northern blotting, RNAse protection assays, SI nu- 
clease analysis, plaque hybridization, and slot blots 
do not provide sufficient throughput to effectively 
utilize the new genomics resources. Newer methods 
such as differential display [6], high-density filter 
hybridization [7,8], serial analysis of gene expression 
[9], and cDNA- and oligonucleotide-based microarray 
"chip" hybridization [10-12] are possible solutions 
to this bottleneck. It is our belief that the microarray 
approach, which allows the monitoring of expres- 
sion levels of thousands of genes simultaneously, is 
a tool of unprecedented power for use in toxicology 
studies. 



Almost without exception, gene expression is al- 
tered during toxicity, as either a direct or indirect 
result of toxicant exposure. The challenge facing 
toxicologists is to define, under a given set of ex- 
perimental conditions, the characteristic and spe- 
cific pattern of gene expression elicited by a given 
toxicant. Microarray technology offers an ideal plat- 
form for this type of analysis and could be the foun- 
dation for a fundamentally new approach to 
toxicology testing. 

MICROARRAY DEVELOPMENT AND APPLICATIONS 

cDNA Microarrays 

In the past several years, numerous systems were 
developed for the construction of large-scale DNA 
arrays. All of these platforms are based on cDNAs 
or oligonucleotides immobilized to a solid sup- 
port. In the cDNA approach, cDNA (or genomic) 
clones of interest are arrayed in a multi-well for- 
mat and amplified by polymerase chain reaction. 
The products of this amplification, which are usu- 
ally 500- to 2000-bp clones from the 3' regions of 
the genes of interest, are then spotted onto solid 
support by using high-speed robotics. By using 
this method, microarrays of up to 10 000 clones 
can be generated by spotting onto a glass substrate 
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[13,14]. Sample detection for microarrays on glass 
involves the use of probes labeled with fluores- 
cent or radioactive nucleotides. 

Fluorescent cDNA probes are generated from con- 
trol and test RNA samples in single-round reverse-tran- 
scription reactions in the presence of fluorescently 
tagged dUTP (e.g., Cy3-dUTP and Cy5-dUTP), which 
produces control and test products labeled with dif- 
ferent fluors. The cDNAs generated from these two 
populations, collectively termed the "probe," are then 
mixed and hybridized to the array under a glass cov- 
erslip [10,11,15]. The fluorescent signal is detected 
by using a custom-designed scanning confocal mi- 
croscope equipped with a motorized stage and lasers 
forfluor excitation [10,11,15]. The data are analyzed 
.with custom digital image analysis software that de- 
termines for each DNA feature the ratio of fluor 1 to 
fluor 2, corrected for local background [16,17]. The 
strength of this approach lies in the ability to label 
RNAs from control and treated samples with differ- 
ent fluorescent nucleotides, allowing for the simul- 
taneous hybridization and detection of both 
populations on one microarray. This method elimi- 
nates the need to control for hybridization between 
arrays. The research groups of Drs. Patrick Brown and 
Ron Davis at Stanford University spearheaded the 
effort to develop this approach, which has been suc- 
cessfully applied to studies of Arabidopsis thaliana 
RNA [10], yeast genomic DNA [15], tumorigenic ver- 
sus non-tumorigenic human tumor cell lines [11], 
human T-cells [18], yeast RNA [19], and human in- 
flammatory disease-related genes [20]. The most dra- 
matic result of this effort was the first published 
account of gene expression of an entire genome, that 
of the yeast Saccharomyces cervisiae [21]. 

In an alternative approach, large numbers of cDNA 
clones can be spotted onto a membrane support, al- 
beit at a lower density [7,22]. This method is useful 
for expression profiling and large-scale screening and 
mapping of genomic or cDNA clones [7,22-24]. In 
expression profiling oh filter membranes, two dif- 
ferent membranes are used simultaneously for con- 
trol and test RNA hybridizations, or a single 
membrane is stripped and reprobed. The signal is 
detected by using radioactive nucleotides and visu- 
alized by phosphorimager analysis or autoradiogra- 
phy. Numerous companies now sell such cDNA 
membranes and software to analyze the image data 
[25-27]. 

Oligonucleotide Microarrays 

Oligonucleotide microarrays are constructed either 
by spotting prefabricated oligos on a glass support 
[13] or by the more elegant method of direct in situ 
oligo synthesis on the glass surface by photolithog- 
raphy [28-30]. The strength of this approach lies in 
its ability to discriminate DNA molecules based on 
single base-pair difference. This allows the applica- 
tion of this method to the fields of medical diagnos- 



tics, pharmacogenetics, and sequencing by hybrid- 
ization as well as gene-expression analysis. 

Fabrication of oligonucleotide chips by photoli- 
thography is theoretically simple but technically 
complex [29,30]. The light from a high-intensity 
mercury lamp is directed through a photolitho- 
graphic mask onto the silica surface, resulting in 
deprotection of the terminal nucleotides in the illu- 
minated regions. The entire chip is then reacted with 
the desired free nucleotide, resulting in selected chain 
elongation. This process requires only 4n cycles 
(where n = oligonucleotide length in bases) to synr 
thesize a vast number of unique oligos, the total num- 
ber of which is limited only by the complexity of the 
photolithographic mask and the chip size [29,31,32]. 

Sample preparation involves the generation of 
double-stranded cDNA from cellular poly(A)+ RNA 
followed by antisense RNA synthesis in an in vitro 
transcription reaction with biotinylated or fluor- 
tagged nucleotides. The RNA probe is then frag- 
mented to facilitate hybridization. If the indirect 
visualization method is used, the chips are incubated 
with fluor-linked streptavidin (e.g., phycoerythrin) 
after hybridization [12,33]. The signal is detected with 
a custom confocal scanner [34]. This method has 
been applied successfully to the mapping of genomic 
library clones [35], to de novo sequencing by hybrid- 
ization [28,36], and to evolutionary sequence com- 
parison of the BRCA1 gene [37]. In addition, 
mutations in the cystic fibrosis [38] and BRCA1 [39] 
gene products and polymorphisms in the human im- 
munodeficiency virus- 1 clade B protease gene [40] 
have been detected by this method. Oligonucleotide 
chips are also useful for expression monitoring [33] 
as has been demonstrated by the simultaneous evalu- 
ation of gene-expression patterns in nearly all open 
reading frames of the yeast strain 5. cerevisiae [12]. 
More recently, oligonucleotide chips have been used 
to help identify single nucleotide polymorphisms in 
the human [41] and yeast [42] genomes. 

THE USE OF MICROARRAYS IN TOXICOLOGY 

Screening for Mechanism of Action 

The field of toxicology uses numerous in vivo 
model systems, including the rat, mouse, and rab- 
bit, to assess potential toxicity and these bioassays 
are the mainstay of toxicology testing. However, in 
the past several decades, a plethora of in vitro tech- 
niques have been developed to measure toxicity, 
many of which measure toxicant-induced DNA dam- 
age. Examples of these assays include the Ames test, 
the Syrian hamster embryo cell transformation as- 
say, micronucleus assays, measurements of sister 
chromatid exchange and unscheduled DNA synthe- 
sis, and many others. Fundamental to all of these 
methods is the fact that toxicity is often preceded 
by, and results in, alterations in gene expression. In 
many cases, these changes in gene expression are a 
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far more sensitive, characteristic, and measurable 
endpointthan the toxicity itself. We therefore pro- 
pose that a method based on measurements of the 
genome-wide gene expression pattern of an organ- 
ism after toxicant exposure is fundamentally infor- 
mative and complements the established methods 
described above. 

We are developing a method by which toxicants 
can be identified and their putative mechanisms of 
action determined by using toxicant-induced gene ex- 
pression profiles. In this method, in one or more de- 
fined model systems, dose and time-course parameters 
are established for a series of toxicants within a given 
prototypic class (e.g., polycyclic aromatic hydrocar- 
bons (PAHs)). Cells are then treated with these agents 
at a fixed toxicity level (as measured by cell survival), 
RNA is harvested, and toxicant-induced gene expres- 
sion changes are assessed by hybridization to a cDNA 
microarray chip (Figure 1). We have developed a cus- 
tom DNA chip, called ToxChip vl.O, specifically for 
this purpose and will discuss it in more detail below. 
The changes in gene expression induced by the test 
agents in the model systems are analyzed, and the 
common set of changes unique to that class of toxi- 
cants, termed a toxicant signature, is determined. , 

This signature is derived by ranking across all ex- 
periments the gene-expression data based on rela- 

Control 
Population 



tive fold induction or suppression of genes in treated 
samples versus untreated controls and selecting the 
most consistently different signals across the sample 
set. A different signature may be established for each 
prototypic toxicant class. Once the signatures are de- 
termined, gene-expression profiles induced by un- 
known agents in these same model systems can then 
be compared with the established signatures. A match 
assigns a putative mechanism of action to the test 
compound. Figure 2 illustrates this signature method 
for different types of oxidant stressors, PAHs, and 
peroxisome proliferators. In this example, the un- 
known compound in question had a gene-expres- 
sion profile similar to that of the oxidant stressors in 
the database. We anticipate that this general method 
will also reveal cross talk between different pathways 
induced by a single agent (e.g., reveal that a com- 
pound has both PAH-like and oxidant-like proper- 
ties). In the future, it may be necessary to distinguish 
very subtle differences between compounds within 
a very large sample set (e.g., thousands of highly simi- 
lar structural isomers in a combinatorial chemistry 
library or peptide library). To generate these highly 
refined signatures, standard statistical clustering tech- 
niques or principal-component analysis can be used. 

For the studies outlined in Figure 2, we developed 
the custom cDNA microarray chip ToxChip vl.O. 
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Figure 1. Simplified overview of the method for sample trative purposes, samples derived from cell culture are depicted, 
preparation and hybridization to cDNA microarrays. For illus- although other sample types are amenable to this analysis. 
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Figure 2. Schematic representation of the method for iden^ 
tification of a toxicant's mechanism of action. In this method, 
gene-expression data derived from exposure of model sys- 
tems to known toxicants are analyzed, and a set of changes 
characteristic to that type of toxicant (termed the toxicant 
signature) is identified. As depicted, oxidant stressors produce 



consistent changes in group A genes (indicated by red and 
green circles), but not group B or C genes (indicated by gray 
circles). The set of gene-expression changes elicited by the 
suspected toxicant is then compared with these characteristic 
patterns, and a putative mechanism of action is assigned to 
the unknown agent. 



The 2090 human genes that comprise this subarray 
were selected for their well-documented involve- 
ment in basic cellular processes as well as their re- 
sponses to different types of toxic insult. Included 
on this list are DNA replication and repair genes, 
apoptosis genes, and genes responsive to PAHs and 
dioxin-like compounds, peroxisome proliferators, 
estrogenic compounds, and oxidant stress. Some of 
the other categories of genes include transcription 
factors, oncogenes, tumor suppressor genes, cyclins, 
kinases, phosphatases, cell adhesion and motility 
genes, and homeobox genes. Also included in this 
group are 84 housekeeping genes, whose hybridiza- 
tion intensity is averaged and used for signal nor- 
malization of the other genes on the chip. To date, 
very few toxicants have been shown to have appre- 
ciable effects oh the expression of these housekeep- 
ing genes. However, this housekeeping list will be 
revised if new data warrant the addition or deletion 
of a particular gene. Table 1 contains a general de- " 
scription of some of the different classes of genes 
that comprise ToxChip vl.O. 

When a toxicant signature is determined, the 
genes within this signature are flagged within the 
database. When uncharacterized toxicants are then 
screened, the data can be quickly reformatted so that 
blocks of genes representing the different signatures 



are displayed [11]. This facilitates rapid, visual in- 
terpretation of data. We are also developing Tox- 
Chip v2.0 and chips for other model systems, 
including rat, mouse, Xenopus, and yeast, for use in 
toxicology studies. 

Animal Models in Toxicology Testing 

The toxicology community relies heavily on the 
use of animals as model systems for toxicology test- 
ing. Unfortunately, these assays are inherently ex- 
pensive, require large numbers of animals and take a 
long time to complete and analyze. Therefore, the 
National Institute of Environmental Health Sciences 
(NIEHS), the National Toxicology Program, and the 
toxicology community at large are committed to re- 
ducing the number of animals used, by developing 
more efficient and alternative testing methodologies. 
Although substantial progress has been made in the 
development of alternative methods, bibassays are 
still used for testing endpoints such as neurotoxic- 
ity, immunotoxicity, reproductive and developmen- 
tal toxicology, and genetic toxicology. The rodent 
cancer bioassay is a particularly expensive and time- 
consuming assay, as it requires almost 4 yr, 1200 
animals, and millions of dollars to execute and ana- 
lyze [43]. In vitro experiments of the type outlined 
in Figure 2 might provide evidence that an unknown 
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Table 1. ToxChip v1.0: A Human cDNA Microarray 
Chip Designed to Detect Responses to Toxic Insult 

No. of genes 



Gene category on chip 



Apoptosis 72 

DNA replication and repair 99 

Oxidative stress/redox homeostasis 90 

Peroxisome prolif erator responsive 22 

Dioxin/PAH responsive 12 

Estrogen responsive 63 

Housekeeping 84 

Oncogenes and tumor suppressor genes 76 

Cell-cycle control 51 

Transcription factors 131 

Kinases 276 

Phosphatases 88 

Heat-shock proteins 23 

Receptors 349 

Cytochrome P450s 30 



*This list is intended as a general guide. The gene categories are not 
unique, and some genes are listed in multiple categories. 

agent is (or is not) responsible for eliciting a given 
biological response. This information would help to 
select a bioassay more specifically suited to the agent 
in question or perhaps suggest that a bioassay is not 
necessary, which would dramatically reduce cost, 
animal use, and time. 

The addition of microarray techniques to stan- 
dard bioassays may dramatically enhance the sen- 
sitivity and interpretability of the bioassay and 
possibly reduce its cost. Gene-expression signatures 
could be determined for various types of tissue-spe- 
cific toxicants, and new compounds could be 
screened for these characteristic signatures, provid- 
ing a rapid and sensitive in vivo test. Also, because 
gene expression is often exquisitely sensitive to low 
doses of a toxicant, the combination of gene-expres- 
sion screening and the bioassay might allow the use 
of lower toxicant doses, which are more relevant to 
human exposure levels, and the use of fewer ani- 
mals. In addition, gene-expression changes are nor- 
mally measured in hours or days, not in the months 
to years required for tumor development. Further- 
more, microarrays might be particularly useful for 
investigating the relationship between acute and 
chronic toxicity and identifying secondary effects 
of a given toxicant by studying the relationship 
between the duration of exposure to a toxicant and 
the gene-expression profile produced. Thus, a bio- 
assay that incorporates gene-expression signatures 
with traditional endpoints might be substantially 
shorter, use more realistic dose regimens, and cost 
substantially less than the current assays do. 

These considerations are also relevant for branches 
of toxicology not related to human health and not 
using rodents as model systems, such as aquatic toxi- 
cology and plant pathology. Bioassays based on the 
flathead minnow, Daphnia, and Arabadopsis could 



also be improved by the addition of microarray analy- 
sis. The combination of microarrays with traditional 
bioassays might also be useful for investigating some 
of the more intractable problems in toxicology re- 
search, such as the effects of complex mixtures and 
the difficulties in cross-species extrapolation. 

Exposure Assessment, Environmental Monitoring, 
and Drug Safety 

The currently used methods for assessment of ex- 
posure to chemical toxicants are based on measure- 
ment of tissue toxin levels or on surrogate markers 
of toxicity, termed biomarkers (e.g., peripheral blood 
levels of hepatic enzymes or DNA adducts). Because 
gene expression is a sensitive endpoint, gene expres- 
sion as measured with microarray technology may 
be useful as a new biomarker to more precisely iden- 
tify hazards and to assess exposure. Similarly, 
microarrays could be used in an environmental- 
monitoring capacity to measure the effect of poten- 
tial contaminants on the gene-expression profiles 
of resident organisms. In an analogous fashion, 
microarrays could be used to measure gene-expres- 
sion endpoints in subjects in clinical trials. The com- 
bination of these gene-expression data and more 
established toxic endpoints in these trials could be 
used to define highly precise surrogates of safety. 

Gene-expression profiles in samples from exposed 
individuals could be compared to the profiles of the 
same individuals before exposure. From this infor- 
mation, the nature of the toxic exposure can be de- 
termined or a relative clinical safety factor estimated. 
In the future it may also be possible to estimate not 
only the nature but the dose of the toxicant for a 
given exposure, based on relative gene-expression 
levels. This general approach may be particularly 
appropriate for occupational-health applications, in 
which unexposed and exposed samples from the 
same individuals may be obtainable. For example, 
a pilot study of gene expression in peripheral-blood 
lymphocytes of Polish coke-oven workers exposed 
to PAHs (and many other compounds) is under con- 
sideration at the NIEHS. An important consideration 
for these types of studies is that gene expression can 
be affected by numerous factors, including diet, 
health, and personal habits. To reduce the effects 
of these confounding factors, it may be necessary 
to compare pools of control samples with pools of 
treated samples. In the future it may be possible to 
compare exposed sample sets to a national database 
of human-expression data, thus eliminating the 
need to provide an unexposed sample from the same 
individual. Efforts to develop such a national gene- 
expression database are currently under way [44,45]. 
However, this national database approach will re- 
quire a better understanding of genome-wide gene 
expression across the highly diverse human popu- 
lation and of the effects of environmental factors 
on this expression. 
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Alleles, Oligo Arrays, and Toxicogenetics 

Gene sequences vary between individuals, and 
this variability can be a causative factor in human 
diseases of environmental origin [4)6,47]. A new area 
of toxicology, termed toxicogenetics, was recently 
developed to study the relationship between genetic 
variability and toxicant susceptibility. This field is 
not the subject of this discussion, but it is worth- 
while to note that the ability of oligonucleotide ar- 
rays to discriminate DNA molecules based on single 
base-pair differences makes these arrays uniquely 
useful for this type of analysis. Recent reports dem- 
onstrated the feasibility of this approach [41,42], 
The NIEHS has initiated the Environmental Genome 
Project to identify common sequence polymor- 
phisms in 200 genes thought to be involved in en- 
vironmental diseases [48]. In a pilot study on the 
feasibility of this application to the Environmental 
Genome Project, oligonucleotide arrays will be used 
to resequence 20 candidate genes. This toxicogenetic 
approach promises to dramatically improve our un- 
derstanding of interindividual variability in disease 
susceptibility. 

FUTURE PRIORITIES 

There are many issues that must be addressed be- 
fore the full potential of microarrays in toxicology 
research can be realized. Among these are model sys- 
tem selection, dose selection, and the temporal na- 
ture of gene expression. In other words, in which 
species, at what dose, and at what time do we look 
for toxicant-induced gene expression? If human 
samples are analyzed, how variable is global gene 
expression between individuals, before and after toxi- 
cant exposure? What are the effects of age, diet, and 
other factors on this expression? Experience, in the 
form of large data sets of toxicant exposures, will 
answer these questions. 

One of the most pressing issues for array scientists 
is the construction of a national public database 
(linked to the existing public databases) to serve as a 
repository for gene-expression data. This relational 
database must be made available for public use, and 
researchers must be encouraged to submit their ex- 
pression data so that others may view and query the 
information. Researchers at the National Institutes 
of Health have made laudable progress in develop- 
ing the first generation of such a database [44,45]. In 
addition, improved statistical methods for gene clus- 
tering and pattern recognition are needed to ana- 
lyze the data in such a public database. 

The proliferation of different platforms and meth- 
ods for microarray hybridizations will improve 
sample handling and data collection and analysis and 
reduce costs. However, the variety of microarray 
methods available will create problems of data com- 
patibility between platforms. In addition, the near- 
infinite variety of experimental conditions under 



which data will be collected by different laborato- 
ries will make large-scale data analysis extremely dif- 
ficult. To help circumvent these future problems, a 
set of standards to be included on all platforms 
should be established. These standards would facili- 
tate data entry into the national database and serve 
as reference points for cross-platform and inter-labo- 
ratory data analysis. : 

Many issues remain to be resolved, but it is clear 
that new molecular techniques such as microarray 
hybridization will have a dramatic impact on toxicol- 
ogy research. In the future, the information gathered 
from microarray-based hybridization experiments will 
form the basis for an improved method to assess the 
impact of chemicals on human and environmental 
health. 
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Abstract 

Recent progress in genomics and proteomics technologies has created a unique opportunity to significantly impact 
the pharmaceutical drug development processes. The perception that cells and whole organisms express specific 
inducible responses to stimuli such as drug treatment implies that unique expression patterns, molecular fingerprints, 
indicative of a drug's efficacy and potential toxicity are accessible. The integration into state-of-the-art toxicology of 
assays allowing one to profile treatment-related changes in gene expression patterns promises new insights into 
mechanisms of drug action and toxicity. The benefits will be improved lead selection, and optimized monitoring of 
drug efficacy and safety in pre-clinical and clinical studies based on biologically relevant tissue and surrogate markers. 
© 2000 Elsevier Science Ireland Ltd. All rights reserved. 
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1. Introduction 

The majority of drugs act by binding to protein 
targets, most to known proteins representing en- 
zymes, receptors and channels, resulting in effects 
such as enzyme inhibition and impairment of 
signal transduction. The treatment-induced per- 
turbations provoke feedback reactions aiming to 
compensate for the stimulus, which almost always 
are associated with signals to the nucleus, result- 
ing in altered gene expression. Such gene expres- 
sion regulations account for both the 
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pharmacological action and the toxicity of a drug 
and can be visualized by either global mRNA or 
global protein expression profiling. Hence, for 
each individual drug, a characteristic gene regula- 
tion pattern, its molecular fingerprint, exists 
which bears valuable information on its mode of 
action and its mechanism of toxicity. 

Gene expression is a multistep process that 
results in an active protein (Fig. 1). There exist 
numerous regulation systems that exert control at 
and after the transcription and the translation 
step. Genomics, by definition, encompasses the 
quantitative analysis of transcripts at the mRNA 
level, while the aim of proteomics is to quantify 
gene expression further down-stream, creating a 
snapshot of gene regulation closer to ultimate cell 
function control. 
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2. Global mRNA profiling 



3. Global protein profiling 



Expression data at the mRNA level can be 
produced using a set of different technologies 
such as DNA microarrays, reverse transcript 
imaging, amplified fragment length polymorphism 
(AFLP), serial analysis of gene expression 
(SAGE) and others. Currently, DNA microarrays 
are very popular and promise a great potential. 
On a typical array, each gene of interest is repre- 
sented either by a long DNA fragment (200-2400 
bp) typically generated by polymerase chain reac- 
tion (PCR) and spotted on a suitable substrate 
using robotics (Schena et al., 1995; Shalon et al., 
1996) or by several short oligonucleotides (20-30 
bp) synthesized directly onto a solid support using 
photolabile nucleotide chemistry (Fodor et al., 
1991; Chee et al., 1996). From control and treated 
tissues, total RNA or mRNA is isolated and 
reverse transcribed in the presence of radioactive 
or fluorescent labeled nucleotides, and the labeled 
probes are then hybridized to the arrays. The 
intensity of the array signal is measured for each 
gene transcript by either autoradiography or laser 
scanning confocal microscopy. The ratio between 
the signals of control and treated samples reflect 
the relative drug-induced change in transcript 
abundance. 



Global quantitative expression analysis at the 
protein level is currently restricted to the use of 
two-dimensional gel electrophoresis. This tech- 
nique combines separation of tissue proteins by 
isoelectric focusing in the first dimension and by 
sodium dodecyl sulfate slab gel electrophoresis- 
based molecular weight separation on the second, 
orthogonal dimension (Anderson et al., 1991). 
The product is a rectangular pattern of protein 
spots that are typically revealed by Coomassie 
Blue, silver or fluorescent staining (Fig. 2). 
Protein spots are identified by mass spectrometry 
following generation of peptide mass fingerprints 
(Mann et al., 1993) and sequence tags (Wilkins et 
al., 1996). Similar to the mRNA approach, the 
ratio between the optical density of spots from 
control and treated samples are compared to 
search for treatment-related changes. 



4. Expression data analysis 

Bioinformatics forms a key element required to 
organize, analyze and store expression data from 
either source, the mRNA or the protein level. The 
overall objective, once a mass of high-quality 
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Fig. 1 . Production of an active protein is a multistep process in which numerous regulation systems exert control at various stages 
of expression. Molecular fingerprints of drugs can be visualized through expression profiling at the mRNA level (genomics) using 
a variety of technologies and at the protein level (proteomics) using two-dimensional gel electrophoresis. 
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Fig. 2. Computerized representation of a Coomassie Blue stained two-dimensional gel electrophoresis pattern of Fischer F344 rat 
liver homogenate. 



quantitative expression data has been collected, is 
to visualize complex patterns of gene expression 
changes, to detect pathways and sets of genes 
tightly correlated with treatment efficacy and toxi- 
city, and to compare the effects of different sets of 
treatment (Anderson et al., 1996). As the drug 
effect database is growing, one may detect similar- 
ities and differences between the molecular finger- 
prints produced by various drugs, information 
that may be crucial to make a decision whether to 
refocus or extend the therapeutic spectrum of a 
drug candidate. 



5. Comparison of global mRNA and protein 
expression profiling 

There are several synergies and overlaps of data 
obtained by mRNA and protein expression analy- 
sis. Low abundant transcripts may not be easily 
quantified at the protein level using standard two- 
dimensional gel electrophoresis analysis and their 
detection may require prefractionation of sam- 
ples. The expression of such genes may be prefer- 
ably quantified at the mRNA level using 
techniques allowing PCR-mediated target amplifi- 
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cation. Tissue biopsy samples typically yield good 
quality of both mRNA and proteins; however, the 
quality of mRNA isolated from body fluids is 
often poor due to the faster degradation of 
mRNA when compared with proteins. RNA sam- 
ples from body fluids such as serum or urine are 
often not very 'meaningful', and secreted proteins 
are likely more reliable surrogate markers for 
treatment efficacy and safety. Detection of post- 
translational modifications, events often related to 
function or nonfunction of a protein, is restricted 
to protein expression analysis and rarely can be 
predicted by mRNA profiling. Information on 
subcellular localization and translocation of 
proteins has to be acquired at the level of the 
protein in combination with sample prefractiona- 
tion procedures. The growing evidence of a poor 
correlation between mRNA and protein abun- 
dance (Anderson and Seilhamer, 1997) further 
suggests that the two approaches, mRNA and 
protein profiling, are complementary and should 
be applied in parallel. 

6. Expression profiling and drug development 

Understanding the mechanisms of action and 
toxicity, and being able to monitor treatment 
efficacy and safety during trials is crucial for the 
successful development of a drug. Mechanistic 
insights are essential for the interpretation of drug 
effects and enhance the chances of recognizing 
potential species specificities contributing to an 
improved risk profile in humans (Richardson et 
al., 1993; Steiner et al., 1996b; Aicher et al., 1998). 
The value of expression profiling further increases 
when links between treatment-induced expression 
profiles and specific pharmacological and toxic 
endpoints are established (Anderson et al., 1991, 
1995, 1996; Steiner et al. 1996a). Changes in gene 
expression are known to precede the manifesta- 
tion of morphological alterations, giving expres- 
sion profiling a great potential for early 
compound screening, enabling one to select drug 
candidates with wide therapeutic windows 
reflected by molecular fingerprints indicative of 
high pharmacological potency and low toxicity 
(Arce et al., 1998). In later phases of drug devel- 



opment, surrogate markers of treatment efficacy 
and toxicity can be applied to optimize the moni- 
toring of pre-clinical and clinical studies (Doherty 
et al., 1998). 



7. Perspectives 

The basic methodology of safety evaluation has 
changed little during the past decades. Toxicity in 
laboratory animals has been evaluated primarily 
by using hematological, clinical chemistry and 
histological parameters as indicators of organ 
damage. The rapid progress in genomics and pro- 
teomics technologies creates a unique opportunity 
to dramatically improve the predictive power of 
safety assessment and to accelerate the drug devel- 
opment process. Application of gene and protein 
expression profiling promises to improve lead se- 
lection, resulting in the development of drug can- 
didates with higher efficacy and lower toxicity. 
The identification of biologically relevant surro- 
gate markers correlated with treatment efficacy 
and safety bears a great potential to optimize the 
monitoring of pre-clinical and clinical trails. 
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DNA array technology makes it possible to rapidly genotype individuals or quantify the expression 
of thousands of genes on a single filter or glass slide, and holds enormous potential in toxicologic 
applications. This potential led. to a U.S. Environmental Protection Agency-sponsored workshop 
titled "Application of Microanays to Toxicology" on 7-8 January 1999 in Research Triangle Park, 
North Carolina. In addition to providing state-of-the-art information on the application of DNA or 
gene microarrays, the workshop catalyzed the formation of several collaborations, corrmiittees, and 
user's groups throughout the Research Triangle Park area and beyond. Potential application: of 
microarrays to toxicologic research and risk assessment include genome-wide expression analyses to 
identify gene-expression networks and toxicant-specific signatures that can be used to define mode 
of action, for exposure assessment, and for environmental monitoring. Arrays may also prove useful 
for monitoring genetic variability and its relationship to toxicant susceptibility in human popula- 
tions. Key words: DNA. arrays, gene arrays, microarrays, toxicology. Environ Health Perspect 
107:681-485 (1999). [Online 6 July 1999] 
htlp://ehpnetl.niths.nsb.g<w/aoa/19^ 



Decoding the genetic blueprint is a dream that 
offers manifold returns in terms of understand- 
ing how organisms develop and function in an 
often hostile environment. With the rapid 
advances in molecular biology over the last 30 
years, the dream has come a step closer to reali- 
ty. Molecular biologists now have the ability to 
elucidate the composition of any genome. 
Indeed, almost 20 genomes have already been 
sequenced and more than 60 are currendy 
under way. Foremost among these is the 
Human Genome Mapping Project. However, 
the genomes of a number of commonly used 
laboratory species are also under intensive 
investigation, including yeast, Arabidopsis, 
maize, rice, zebra fish, mouse, rat, and dog. It 
is widely expected that the completion of such 
programs will facilitate the development of 
many powerful new techniques and approach- 
es to diagnosing and treating genetically and 
environmentally induced diseases which afflict 
mankind. However, the vast amount of data 
being generated by genome mapping will 
require new high-throughput technologies to 
investigate the function of the millions of new 
genes that are being reported. Among the most 
widely heralded of the new functional 
genomics technologies are DNA arrays, which 
represent perhaps the most anticipated new 
molecular biology technique since polymerase 
chain reaction (PCR). 

Arrays enable the study of literally thou- 
sands of genes in a single experiment. The 
potential importance of arrays is enormous and 
has been highlighted by the recent publication 
of an entire Nature Genetics supplement dedi- 
cated to the technology (/). Despite this huge 
surge of interest, DNA arrays arc still little used 
and largely unproven, as demonstrated by the 
high ratio of review and press articles to actual 
data papers. Even so, the. potential they offer 



has driven venture capitalists into a frenzy of 
investment and many new companies are 
springing up to claim a share of this rapidly 
developing market. 

The U.S. Environmental Protection 
Agency (EPA) is interested in applying DNA 
array technology to ongoing toxicologic stud- 
ies. To learn more about the current state of 
the technology, the Reproductive Toxicology 
Division (RTD) of the National Health and 
Environmental Effects Research Laboratory 
(NHEERL; Research Triangle Park, NC) 
hosted a workshop on "Application of 
Microarrays to Toxicology" on 7-8 January 
1999 in Research Triangle Park, North 
Carolina. The workshop was organized by 
David Dix, Robert Kavlock, and John Rockett 
of the RTD/NHEERL. Twenty-two intra- 
mural and extramural scientists from govern- 
ment, academia, and industry shared informa- 
tion, data, and opinions on the current and 
future applications for this exciting new tech- 
nology. The workshop had more than 150 
attendees, including researchers, students, and 
administrators from the EPA, the National 
Institute of Environmental Health Sciences 
(NIEHS), and a number of other establish- 
ments from Research Triangle Park and 
beyond. Presentations ranged from the tech- 
nology behind array production through the 
sharing of actual experimental data and projec- 
tions on the future importance and applica- 
tions of arrays. The information contained in 
the workshop presentations should provide aid 
and insight into arrays in general and their 
application to toxicology in particular. 

Array Elements 

In the context of molecular biology, the word 
"array" is normally used to refer to a series of 
DNA or protein elements firmly attached in 



a regular pattern to some kind of supportive 
medium. DNA array is often used inter- 
changeably with gene array or microarray. 
Although not formally defined, microarray is 
generally used to describe the higher density 
arrays typically printed on glass chips. The 
DNA elements that make up DNA arrays 
can be oligonucleotides, partial gene 
sequences, or full-length cDNAs. Companies 
offering p re-made arrays that contain less 
than full-length clones normally use regions 
of the genes which are specific to that gene to 
prevent false positives arising through cross- 
hybridization. Sequence verification of 
cDNA clone identity is necessary because of 
errors in identifying specific clones from 
cDNA libraries and databases. P remade 
DNA arrays printed on membranes are cur- 
rendy or imminendy available for human, 
mouse, and rat. In most cases they contain 
DNA sequences representing several thou- 
sand different sequence clusters or genes as 
delineated through the National Center for 
Biotechnology Information UniGcne Project 
(2). Many of these different UniGene clusters 
(putative genes) are represented only by 
expressed sequence tags (ESTs). 

Array Printing 

Arrays are typically printed on one of two 
types of support matrix. Nylon membranes 
are used by most off-the-shelf array providers 
such as Clontech Laboratories, Inc. 
(Palo Alto, CA), Genome Systems, Inc. (St. 
Louis, MO), and Research Genetics, Inc. 
(Huntsville, AL). Microarrays such as those 
produced by Affymetrix, Inc. (Santa Clara, 
CA), Incyte Pharmaceuticals, Inc (Palo Alto, 
CA), and many do-it-yourself (DIY) arraying 
groups use glass wafers or slides. Although 
standard microscope slides may be used, they 
must be preprepared to facilitate sticking 
of the DNA to the glass. Several different 
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coatings have been successfully used, includ- 
ing silane and lysine. The coating of slides 
can easily be carried out in the laboratory, 
but many prefer the convenience of precoated 
slides available from suppliers. 

Once the support matrix has been pre- 
pared, the DNA elements can be applied by 
several methods. Asymetrix, Inc, has devel- 
oped a unique photolithographic technology 
for attaching oligonucleotides to glass wafers. 
More commonly, DNA is applied by either 
noncontact or contact printing. Noncontact 
printers can use thermal, solenoid, or piezoelec- 
tric technology to spray aliquots of solution 
onto the support matrix and may be used to 
produce slide or membrane-based arrays. 
Cartesian Technologies, Inc (Irvine, CA) has 
developed nQUAD technology for use in its 
PixSys printers. The system couples a syringe 
pump with the microsolenoid valve, a combi- 
nation that provides rapid quantitative dispens- 
ing of nanoHter volumes (down to 4 2 nL) over 
a variable volume range. A different approach 
to noncontact printing uses a solid pin and ring 
combination (Genetic MicroSystems, Inc., 
Wobum, MA). This system (Figure 1) allows a 
broader range of sample, including cell suspen- 
sions and particulates, because the printing 
head cannot be blocked up in the same way as 
a spray nozzle. Fluid transfer is controlled in 
this system primarily by the pin dimensions 
and the force of deposition, although the 
nature of the support matrix and the sample 
will also affect transfer to some degree. 

In contact printing, the pin head is dipped 
in the sample and then touched to the support 
matrix to deposit a small aliquot. Split pins 
were one of the first conraa-prinring devices 
to be reported and are the suggested format 
for DIY arraycrs, as described by Brown (3). 
Split pins are small metal pins with a precise 
groove cut vertically in the middle of die pin 
tip. In this system, 1-48 split pins are posi- 
tioned in the pin-head. The split pins work by 
simple capillary action, not unlike a fountain 
pen— when the pin heads are dipped in the 
sample, liquid is drawn into the pin groove. A 
small (fixed) volume is then deposited each 
time the split pins are gently touched to 
the support matrix. Sample (100-500 pL 
depending on a variety of parameters) can be 
deposited on multiple slides before refilling is 
required, and array densities of > 2,500 
spots/cm 2 may be produced. The deposit vol- 
ume depends on the split size, sample fluidi- 
ty, and the speed of printing. Split pins are 
relatively simple to produce and can be made 
in-house if a suitable machine shop is avail- 
able. Alternatively, they can be obtained 
directly from companies such as TeleChem 
International, Inc (Sunnyvale, CA). 

Irrespective of their source, printers 
should be run through a preprint sequence 
prior to producing the actual experimental 



arrays; the first 100 or so spots of a new run 
tend to be somewhat variable. Factors effect- 
ing spot reproducibility include slide treat- 
ment homogeneity, sample differences, and 
instrument errors. Other factors that come 
into play include clean ejection of the drop 
and clogging (nQUAD printing) and 
mechanical variations and long-term alter- 
ation in print-head surface of solid and split 
pins. However, with careful preparation it is 
possible to get a coefficient of variance for 
spot reproducibility below 10%. 

One potential printing problem is sample 
carryover. Repeated washing, blotting, and 
drying (vacuum) of print pins between samples 
is normally effective at reducing sample carry- 
over to negligible amounts. Printing should 
also be carried out in a controlled environ- 
ment. Humidified chambers are available in 
which to place printers. These help prevent 
dust contamination and produce a uniform 
drying rate, which is important in deterrnining 
spot size, quality, and reproducibility. 

In summary, although several printing 
technologies are available, none are par- 
ticularly outstanding and the bottom line 
is that they are still in a relatively early stage 
of evolution. 

Array Hybridization 

The hybridization protocol is, practically 
speaking, relatively straightforward and those 
with previous experience in blotting should 
have little difficulty. Array hybridizations 
are, in essence, reverse Southern/Northern 
blots — instead of applying a labeled probe to 
the target population of DNA/RNA, the 
labeled population is applied to the probers). 
With membrane-based arrays,, the control and 
treated mRNA populations are normally con- 
verted to cDNA and labeled with isotope (eg., 
33 P) in the process. These labeled populations 
are then hybridized independendy to parallel 
or serial arrays and the hybridization signal is 
detected with a phosporimager. A less com- 
monly used alternative to radioactive probes is 
enzymatic detection. The probe may be 
biotinylated, haptenylated, or have alkaline 
phosphatase/horseradish peroxidase attached 
Hybridization is detected by enzymatic reac- 
tion yielding a color reaction (4). Differences 
in hybridization signals can be detected by eye 
or, more accurately, with the help of digital 
imaging and commercially available software. 
The labeling of the test populations for slide- 
based microarrays uses a slightly different 
approach. The probe typically consists of two 
samples of poryA* RNA (usually from a treated 
and a control population) that are converted to 
cDNA; in the process each is labeled with a 
different fluor. The independently labeled 
probes are then mixed together and hybridized 
to a single microarray slide and the resulting 
combined fluorescent signal is scanned. After 




Well containing 
sample solution 




Figure 1. Genetic Microsystems {Wobum, MA) pin 
ring system for printing arrays. The pin ring com- 
bination consists of a circular open ring oriented 
parallel to the sample solution, with a vertical pin 
centered over the ring. When the ring is dipped 
into a solution and lifted, it withdraws an aliquot 
of sample held by surface tension. To spot the 
sample, the pin is driven down through the ring 
and a portion of the solution is transferred to the 
bottom of the pin. The pin continues to move 
downward until the pendant drop of solution 
makes contact with the underlying surface. The 
pin is then lifted, and gravity and surface tension 
cause deposition of the spot onto the array. 
Figure from Rowers et al. ( M), with permission 
from Genetic Microsystems. 

normalization, it is possible to determine the 
ratio of fluorescent signals from a single 
hybridization of a slide-based microarray. 

cDNA derived from control and treated 
populations of RNA is most commonly 
hybridized to arrays, although sub tractive 
hybridization or differential display reactions 
may also be used Fluorophore- or radiola- 
beled nucleotides are directly incorporated 
into the cDNA in the process of converting 
RNA to cDNA. Alternatively, 5' end-labeled 
primers may be used for cDNA synthesis. 
These are labeled with a fluorophore for 
direct visualization of the hybridized array. 
Alternatively, biotin or a hapten may be 
attached to the primer, in which case fluor- 
labeled streptavidin or antibody must be 
applied before a signal can be generated. The 
most commonly used fluorophores at present 
are cyanine (Cy)3 and Cy5 (Amersham 
Pharmacia Biotech AB, Uppsala, Sweden). 
However, the relative expense of these fluo- 
rescent conjugates has driven a search for 
cheaper alternatives. Fluorescein, rhodarnine> 
and Texas red have all been used, and 
companies such as Molecular Probes, Inc 
(Eugene, OR) are developing a series of 
labeled nucleotides with a wide range of exci- 
tation and emission spectra which may prove 
to function as well as the Cy dyes. 
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Table 1. Advantages and disadvantages of different microarray scanning systems. 



Nonconfocal laser scanner 


Advantages 
Disadvantages 


Few moving parts 

Fast scanning of bright 
samples 

Less appropriate for dim 
samples 

Optical scatter can limit 
performance 


Relatively simple optics 

Low light collection efficiency 
Background artifacts not rejected 
Resolution typically low 


Small depth of focus reduces 
artifacts 

May have high light collection 
efficiency 

Small depth of focus requires 
scanning precision 



Analysis of DNA Microarrays 

Membrane-based arrays are normally analyzed 
on film or with a phosphorimager, whereas 
chip-based arrays require more specialized scan- 
ning devices. These can be divided into three 
main groups: the charge-coupled device camera 
systems, the nonoonfocal laser scanners, and the 
confbcal laser scanners. The advantages and dis- 
advantages of each system are listed in Table 1. 

Because a typical spot on a microarray can 
contain > 10 s molecules, it is dear that a large 
variation in signal strength may occur. 
Current scanners cannot work across this 
many orders of magnitude (4 or 5 is more typ- 
ical). However, the scanning parameters can 
normally be adjusted to collect more or less 
signal, such that two or three scans of the same 
array should permit the detection of rare and 
abundant genes. 

When a microarray is scanned, the fluores- 
cent images are captured by software normally 
included with the scanner. Several commercial 
suppliers provide additional software for quan- 
tifying array images, but the software tools are 
cons tan dy evolving to meet the developing 
needs of researchers, and it is prudent to 
define one's own needs and clarify the exact 
capabilities of the software before its purchase. 
Issues that should be considered include the 
following: 

• Can the software locate offset spots? 

• Can it quantitate across irregular hybridiza- 
tion signals? 

* Can the arrayed genes be programmed in for 
easy identification and location? 

* Can the software connect via the Internet to 
databases containing further infbrmarion on 
the genc(s) of interest? 

One of the key issues raised at the work- 
shop was the sensitivity of microarray technol- 
ogy. Experiments by General Scanning, Inc. 
(Watertown, MA), have shown that by using 
the Cy dyes and their scanner, signal can be 
detected down to levels of < 1 fluor molecule 
per square micrometer, which translates to 
detecting a rare message at approximately one 
copy per cell or less. 

Array Applications 

Although arrays are an emerging technology 
certain to undergo improvement and 
alterarion,»they have already been applied use- 
fully to a number of model systems. Arrays are 
at their most powerful when they contain the 
entire genome of the species they are being 
used to study. For this reason, they have strong 
support among researchers utilizing yeast and 
Qunorhabditis eltgans (5). The genomes of 
both of these species have been sequenced and, 
in the case of yeast, deposited onto arrays for 
examination of gene expression (6,7). With 
both of these species, it is relatively easy to 
perturb individual gene expression. Indeed, C 



CCD, charge-coupled device. 
From Kawasaki (73). 

eUgans knockouts can be made simply by 
soaking the worms in an antisense solution of 
the gene to be knocked out. 

By a process of systematic gene disrup- 
tion, it is now possible to examine the cause 
and effect relationships between different 
genes in these simple organisms. This kind of 
approach should help elucidate biochemical 
pathways and genetic control processes, 
decon volute polygenic interactions, and 
define the architecture of the cellular network. 
A simple case study of how this can be 
achieved was presented by Butow [University 
of Texas Southwestern Medical Center, 
Dallas, TX (Figure 2)]. Although it is the 
phenotypic result of a single gene knockout 
that is being examined, the effect of such 
perturbation will almost always be polygenic. 
Polygenic interactions will become increasing- 
ly important as researchers begin to move' 
away from single gene systems when examin- 
ing the nature of toxicologic responses to 
external stimuli. This is especially important 
in toxicology because the phenotype pro- 
duced by a given environmental insult is 
never the result of the action of a single gene; 
rather, it is a complex interaction of one or 
multiple cellular pathways. Phenomena such 
as quantitative trait (the continuous variation 
of phenotype), epistasis (the effect of alleles of 
one or more genes on the expression of other 
genes), and penetrance (proportion of indi- 
viduals of a given genotype that display a par- 
ticular phenotype) will become increasingly 
evident and important as toxicologists push 
toward the ultimate goal of matching the 
responses of individuals to different 
environmental stimuli. 

Analysis of the transcriptome (the expres- 
sion level of all the genes in a given cell popula- 
tion) was a use of arrays addressed by several 
speakers. Unfortunately, current gene nomen- 
clature is often ronfusing in that single genes 
are allocated multiple names (usually as a result 
of independent discovery by different laborato- 
ries), and there was a call for standardization of 
gene nomenclature. Nevertheless, once a tran- 
scriptome has been assembled it can then be 
transferred onto arrays and used to screen any 
chosen system. The EPA MicroArray 
Consortium (EPAMAQ is assembling testes 



transcriptomes for human, rat, and mouse. In a 
slightly different approach, Nuwaysir et al. (8) 
describes how the NIEHS assembled what is 
effectively a "toxicological transcriptome" — a 
library of human and mouse genes that have 
previously been proven or implicated in 
responses to toxicologic insults. Gontech 
Laboratories, Inc (Palo Alto, CA), has begun a 
similar process by developing stress/ toxicology 
filter arrays of rat, mouse, and human genes. 
Thus, rather than being tissue or cell specific 
these stress/ toxicology arrays can be used across 
a variety of model systems to look for alter- 
ations in the expression of toxicologically 
important genes and define the new field of 
toxicogenomics. The potential to identify toxi- 
cant families based on tissue- or cell-specific 
gene expression could revolutionize drug test- 
ing. These molecular signatures or fingerprints 
could not only point to the possible 
toxicity/carcinogenicity of newly discovered 
compounds (Figure 3), but also aid in elucidat- 
ing their mechanism of action through identifi- 
cation of gene expression networks. By exten- 
sion, such signatures could provide easily iden- 
tifiable biomarkers to assess the degree, rime, 
and nature of exposure. 

DNA arrays are primarily a tool for exam- 
ining differential gene expression in a given 
model. In this context they are referred to as 
closed systems because they lack the ability of 
other differential expression technologies, eg., 
differential display and subtractive hybridiza- 
tion, to detect previously unknown genes not 
present on the array. This would appear to 
limit the power of DNA arrays to the imagina- 
tions and preconceptions of the researcher in 
selecting genes previously characterized and 
thought to be involved in the model system. 
However, the various genome sequencing pro- 
jects have created a new category of 
sequence— rthe EST — that has partially molli- 
fied this deficiency. ESTs are cDNAs expressed 
in a given tissue that, although they may share 
some degree of sequence similarity to previous- 
ly characterized genes, have not been assigned 
specific genetic identity. By incorporating EST 
clones into an array, it is possible to monitor 
the expression of these unknown genes. This 
can enable the identification of ptcviously 
uncharacterized genes that may have biologic 
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significance in the model system. Filter arrays 
from Research Genetics and slide arrays from 
Incyte Pharmaceuticals both incorporate large 
numbers of ESTs from a variety of species. 

A further use of microarrays is the identifi- 
cation of single nucleotide polymorphisms 
(SNPs). These genomic variations are abun- 
dant — they occur approximately every 1 kb or 
so— and are the basis of restriction fragment 
length polymorphism analysis used in forensic 
analysis. Affymetrix, Inc., designed chips that 
contain multiple repeats of the same gene 
sequence. Each position is present with all four 
possible bases. After the hybridization of the 
sample, the degree of hybridization to the dif- 
ferent sequences can be measured and the exact 
sequence of the target gene deduced SNPs are 
thought to be of vital importance in drug 
metabolism and toxicology. For example, sin- 
gle base differences in the regulatory region or 
active site of some genes can account for huge 
differences, in the activity of that gene. Such 
SNPs are thought to explain why some people 
are able to metabolize certain xenobiotics bet- 
ter than others. Thus, arrays provide a further 
tool for the toxicologist investigating the 
nature of susceptible subpopulations and toxi- 
cologic response. 

There are soil many wrinkles to be ironed 
out before arrays become a standard tool for 
toxicologists. The main issues raised at the 
workshop by those with hands-on experience 
were the following: 

* Expense the cost of purchasing/contracting 
this technology is still too great for many 
individual laboratories. 




Figure 2. Potential effects of gene knockout within 
positively and negatively regulated gene expression 
networks. /, is limiting in wild type for expression of 
if \A) A simple, two-component linear regulatory 
network operating on gene ^ where /, is a positive 
effector of ^ and j n is either a positive or negative 
effector of i y This network could be deduced by 
examining the consequence of (S) deleting j n on the 
expression of i\ and ^ where the expression of ^ 
would be decreased or increased depending on 
whether j n was a positive or negative regulator. 
These and other connected components of even 
greater complexity could be revealed by genome- 
wide expression analysts. From Butow ( 75). 



> Cones: the logistics of identifying, obtaining, 
and maintaining a set of nonrcdundant, non- 
contaminated, sequence-verified, species/cell/ 
tissue/field-specific clones. 

* Use of inbred strains: where whole-organism 
models are being used, the use of inbred 
strains b important to reduce the potentially 
confusing effects of the individual variation 
typically seen in outbred populations. 

> Probe the need for relatively large amounts 
of RNA, which limits the type of sample 
(eg., biopsy) that can be used. Also, different 
RNA extraction methods can give different 
results. 

t Specificity: the ability to discriminate accu- 
; rarely between closely related genes (eg., the 
\ cytochrome p450 family) and splice variants. 

♦ Quantitation: the quantitation of gene 
| expression using gene arrays is still open to 

debate. One reason for this is the different 
incorporation of the labeling dyes. However, 
the main difficulty lies in knowing what to 
normalize against. One option is to include a 
large number of so-called housekeeping genes 
in the array. However, the expression of these 
genes often change depending on the tissue 
and the toxicant, so it is necessary to charac- 
terize the expression of these genes in the 
model system before utilizing them. This is 
clearly not a viable option when screening 
multiple new compounds. A second option 
is to include on the array genes from a nonre- 
lated species (eg., a plant gene on an animal 
array) and to spike the probe with synthetic 
RNA(s) complementary to the gene(s). 
■ Reproducibility: this is sometimes question- 
able, and a figure of approximately two or 
three repeats was used as the minimum num- 
ber required to confirm initial findings. 



Again, however, most people advocated the 
use of Northern blots or reverse transcriptase 
PGR to confirm findings. 

• Sensitivity: concerns were voiced about the 
number of target molecules that must be pre- 
sent in a sample for them to be detected on 
the array. 

• Efficiency: reproducible identification of 1 .5- 
to 2-fold differences in expression was report- 
ed, although the number of genes that 
undergo this level of change and remain 
undetected is open to debate. It is important 
that this level of detection be ultimately 
achieved because it is commonly perceived 
that some important transcription factors 
and their regulators respond at such low lev- 
els. In most cases, 3- to 5-fold was the inini- 
mum change that most were happy to 
accept. 

• Bioinfbrmarics: perhaps the greatest concern 
was how to accurately interpret the data with 
the greatest accuracy and efficiency. The 
biggest headache is trying to identify net- 
works of gene expression that are common to 
different treatments or doses. The amount of 
data from a single experiment is huge. It may 
be that, in the future, several groups individ- 
ually equipped with specialized software algo- 
rithms for studying their favorite genes or 
gene systems will be able to share the same 
hybridized chips. Thus, arrays could usher in 
a new perspective on collaboration and the 
sharing of data. 

EPAMAC 

Perhaps the main reason most scientists are 
unable to use array technology is the high cost 
involved, whether buying off-the-shelf mem- 
branes, using contract printing services, or 




Figure 3. Gene expression profiles — also called fingerprints or signatures — of known toxicants or toxi- 
cant families may, in the future, be used to identify the potential toxicity of new drugs, etc. In this exam- 
ple, the genetic signature of test compound 1 is identical to that of known peroxisome prolrferators, 
whereas that of test compound 2 does not match any known toxicant family. Based on these results, test 
cpmpound 2 would be retained for further testing and test compound 1 would be eliminated. 
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producing chips in-house. In view of this, 
researchers at the RTD/NHEERL initiated 
the EPAMAC. This consortium brings 
together scientists from the EPA and .a num- 
ber of extramural labs with the aim of devel- 
oping microarray capability through the shar- 
ing of resources and data. EPAMAC 
researchers are primarily interested in the 
developmental and toxicologic changes seen 
in testicular and breast tissue, and a portion 
of the workshop was set aside for EPAMAC 
members to share their ideas on how the 
experimental application of microarrays could 
facilitate their research. One of the central 
areas of interest to EPAMAC members is the 
effect of xenobiotics on male fertility and 
reproductive health. Of greatest concern is 
the effect of exposure during critical periods 
of development and germ cell differentiation 
(5), and how this may compromise sperm 
counts and quality following sexual matura- 
tion (70). As well as spermatogenic tissue, 
there is also interest in how residual mRNA 
found in mature sperm (II) could be used as 
an indicator of previous xenobiodc effects (it 
is easier to obtain a semen sample than a tes- 
ticular biopsy). Arrays will be used to examine 
and compare the effect of exposure to heat 
and chemicals in testicular and epididymal 
gene expression profiles, with the aim of 
establishing relationships/associations 
between changes in developmental landmarks 
and the effects on sperm count and quality. 
Cluster, pattern, and other analysis of such 
data should help identify hidden relationships 
between genes that may reveal potential 
mechanisms of action and uncover roles for 
genes with unknown functions. 

Summary 

The full impact of DNA arrays may not be 
seen for several years, but the interest shown at 
this regional workshop indicates the high level 
of interest that they roster. Apart from educat- 
ing and advertising the various technologies in 
this field, this workshop brought together a 
number of researchers from the Research 
Triangle Park area who are already using DNA 
arrays. The interest in sharing ideas and experi- 
ences led to the initiation of a Triangle array 
users group. 



Array technology is still in its infancy. This 
meaiis that the hardware is still improving and 
there) is no current consensus for standard pro- 
cedures, quantitation, and interpretation. 
Consistency in setting and scanning arrays is 
not yet optimized, and this is one of the most 
critical requirements of any experiment. In 
addition, one of the dark regions of array tech- 
nolofey — strife in the courts over who owns 
what; portions of it — has further muddled the 
future and is a potential barrier toward the 
development of consensus procedures. 

Perhaps the greatest hurdle for the applica- 
tion of arrays is the actual interpretation of 
data. No specialists in bioinformatics attended 
the Workshop, largely because they are rare and 
because as yet no one seems clear on the best 
method of approaching data analysis and inter- 
pretation. Cross-referencing results from mul- 
tiple ^experiments (time, dose, repeats, different 
anirrjals, different species) to identify common- 
ly expressed genes is a great challenge. In most 
cases; we are still a long way from understand- 
ing How the 'expression of gene Xis related to 
the Expression of gene Y, and ordering gene 
expression to delineate causal relationships. 

To the ordinary scientist in the typical lab- 
oratc ry, however, the most immediate prob- 
lem is a lack of affordable instrumentation. 
One! can purchase premade membranes at 
relatively affordable prices. Although these 
may! be useful in identifying individual genes 
to pursue in more detail using other methods, 
the 4 umbers that would be required for even a 
small routine toxicology experiment prohibit 
this as a truly viable approach. For the toxicol- 
ogistt, there is a need to carry out multiple 
experiments — dose responses, time curves, 
multiple animals, and repeats. Glass-based 
DN^ arrays are most attractive in this context 
they can be prepared in large batches 
frorrj the same DNA source and accommo- 
date control and treated samples on the same 
chip] Another problem with current off-the- 
arrays is that they often do not contain 
one or more of the particular genes a group is 
interested in. One alternative is to obtain 
r produce a set of custom clones and 
contract printing of membranes or slides 
out by a company such as Genomic 
Solutions, Inc. (Ann Arbor, MI). This approach 
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is less expensive than laying out capital for 
one's own entire system, although at some 
point it might make economic sense to print 
ones own arrays. 

Finally, DNA arrays are currently a team 
effort. They are a technology that uses a wide 
range of skills including engineering, statistics, 
molecular biology, chemistry, and bioinfor- 
matics. Because most hdmduals are skilled in 
only one or perhaps two of these areas, it 
appears that success with arrays may be best 
expected by teams of collaborators consisting 
of individuals having each of these skills. 

Those considering array applications may 
be amused or goaded on by the following 
quote from Fortune magazine (12): 

Microprocessors have reshaped our economy, . 
spawned vast fortunes and changed the way we live. 
Gene chips could be even bigger. 

Although this comment may have been 
designed to excite the imagination rather than 
accurately reflect the truth, it is' fair to say that 
the age of functional genomics is upon us. 
DNA arrays look set to be an important tool in 
this new age of biotechnology and will likely 
contribute answers to some of toxicology's 
most fundamental questions. 
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Subject: RE: [Fwd: Toxicology Chip] 
Dale: Mon. 3 Jul 2000 08:09:45 -0400 
From: "Afshari .Cynthia" <afshari@'niehs.nih.gov> 
To: ""Diana Hamlet-Cox*" <dianahc<&incyie.com> 

You car. see the list of clones that we have or* our 12K chip at 
httr: mar.uel .r.iehs.r.ih.scv :r*aps guest clcnesrch. cfr. 

We'selected a subset of genes (2000K) :ha: we believed critical tr to:-: 
response and basic cellular processes and added a set of clones and ESTs tr 
this. We have included a set of control genes (80-) that were selected by 
the KHGP.I because they did not change across a large set of array 
experiments. However, we have found that some of these genes chance 
signf icantly after tox treatments and are in the process cf looking at :'r.e 
variation of each of these 80* genes across our experiments. 
Our chips are constantly changing and being updated and we hope that cur 
data will lead us to what the toxchip should really be. 
Z hope this answers your question. 

Cindy Afshari . 
> 

> From: Diana Hamlez-Cox 

> Sent: Monday/ June 26. 2000 8:52 PM 

> To: afshariGniehs.nih.gov 

> Subjecz: [Fwd: Toxicology Chip] 
> 

> Dear Dr. Afshari, 
> 

> Since I have not yet had a response from Bill Grigg, perhaps he was not 

> the right person to contact. 
> 

> Can you help me in this matter? I don't need to know the sequences , 

> necessarily, but I would like very much to know what types of sequences 

> are being used, e.g.. GPCRs (more specific?) . ion channels, etc. 
> 

> Diana Hamlet-Cox 
> 

> original Message 

> Subject: Toxicology Chip 

> Daze: Mon. 19 Jun 2000 18:31:48 -0700 

> From: Diana Hamlet-Cox <dianahc9incyte.com> 

> Organization: Incyte Pharmaceuticals 

> To: grigg9niehs.nih.gov 
> 

> Dear Colleague: 
> 

> I am doing literature research on the use of expressed genes as 

> pharmacotoxicology markers, and found the Press Release dated February 

> 29. 2000 regarding the work of the NIZHS in this area. I would like to 

> know if there is a resource I can access (or you could provide?) that 

> would give me a list of the 12,000 genes chat are on your Human ToxChip 

> Mi c roar ray . In particular, I am interested in the criteria used to 

> select sequences for the ToxChip. including any control sequences 

> included in the microarray. 
> 

> Thank you for your assistance in this request. 
> 

> Diana Hamlet-Cox. Ph.D. 

> Incyte Genomics. Inc. 
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ABSTRACT Pairwise sequence comparison methods have 
been assessed using proteins whose relationships are known 
reliably from their structures and functions, as described in 
the SCOP database [Murzin, A. G., Brenner, S. E., Hubbard, T. 
& Chothia C. (1995) /. MoL BioL 247, 536-540]. The evalua- 
tion tested the programs BLAST [Altschul, S. F., Gish, W., 
Miller, W., Myers, E. W. & Lipman, D. J. (1990). /. MoL BioL 
215, 403-410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) 
Methods EnzymoL 266, 460-480], FASTA [Pearson, W. R. & 
Lipman, D.J. (1988) Proc. Natl Acad. Sci. USA 85, 2444-2448], 
and SSEARCH [Smith, T. F. & Waterman, M. S. (1981) /. MoL 
BioL 147, 195-197] and their scoring schemes. The error rate 
of all algorithms is greatly reduced by using statistical scores 
to evaluate matches rather than percentage identity or raw 
scores. The E-value statistical scores of SSEARCH and fasta are 
reliable: the number of false positives found in our tests agrees 
well with the scores reported. However, the P- values reported 
by BLAST and WU-BLAST2 exaggerate significance by orders of 
magnitude. SSEARCH, fasta ktup = 1, and WU-BLAST2 perform 
best, and they are capable of detecting almost all relationships 
between proteins whose sequence identities are >30%. For 
more distantly related proteins, they do much less well; only 
one-half of the relationships between proteins with 20-30% 
identity are found. Because many homologs have low sequence 
similarity, most distant relationships cannot be detected by 
any pairwise comparison method; however, those which are 
identified may be used with confidence. 

Sequence database searching plays a role in virtually every 
branch of molecular biology and is crucial for interpreting the 
sequences issuing forth from genome projects. Given the 
method's central role, it is surprising that overall and relative 
capabilities of different procedures are largely unknown. It is 
difficult to verify algorithms on sample data because this 
requires large data sets of proteins whose evolutionary rela- 
tionships are known unambiguously and independently of the 
methods being evaluated. However, nearly all known ho- 
mologs have been identified by sequence analysis (the method 
to be tested). Also, it is generally very difficult to know, in the 
absence of structural data, whether two proteins that lack clear 
sequence similarity are unrelated. This has meant that al- 
though previous evaluations have helped improve sequence 
comparison, they have suffered from insufficient, imperfectly 
characterized, or artificial test data. Assessment also has been 
problematic because high quality database sequence searching 
attempts to have both sensitivity (detection of homologs) and 
specificity (rejection of unrelated proteins); however, these 
complementary goals are linked such that increasing one 
causes the other to be reduced. 
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Sequence comparison methodologies have evolved rapidly, 
so no previously published tests has evaluated modern versions 
of programs commonly used. For example, parameters in 
BLAST (1) have changed, and WU-BLAST2 (2) — which produces 
gapped alignments — has become available. The latest version 
of fasta (3) previously tested was 1.6, but the current release 
(version 3.0) provides fundamentally different results in the 
form of statistical scoring. 

The previous reports also have left gaps in our knowledge. 
For example, there has been no published assessment of 
thresholds for scoring schemes more sophisticated than per- 
centage identity. Thus, the widely discussed statistical scoring 
measures have never actually been evaluated on large data- 
bases of real proteins. Moreover, the different scoring schemes 
commonly in use have not been compared. 

Beyond these issues, there is a more fundamental question: 
in an absolute sense, how well does pairwise sequence com- 
parison work? That is, what fraction of homologous proteins 
can be detected using modern database searching methods? 

In this work, we attempt to answer these questions and to 
overcome both of the fundamental difficulties that have hin- 
dered assessment of sequence comparison methodologies. 
First, we use the set of distant evolutionary relationships in the 
SCOP: Structural Classification of Proteins database (4), which 
is derived from structural and functional characteristics (5). 
The SCOP database provides a uniquely reliable set of ho- 
mologs, which are known independently of sequence compar- 
ison. Second, we use an assessment method that jointly mea- 
sures both sensitivity and specificity. This method allows 
straightforward comparison of different sequence searching 
procedures. Further, it can be used to aid interpretation of real 
database searches and thus provide optimal and reliable 
results. 

Previous Assessments of Sequence Comparison. Several 
previous studies have examined the relative performance of 
different sequence comparison methods. The most encom- 
passing analyses have been by Pearson (6, 7), who compared 
the three most commonly used programs. Of these, the. Smith- 
Waterman algorithm (8) implemented in SSEARCH (3) is the 
oldest and slowest but the most rigorous. Modern heuristics 
have provided BLAST (1) the speed and convenience to make 
it the most popular program. Intermediate between these two 
is FASTA (3), which may be run in two modes offering either 
greater speed (ktup = 2) or greater effectiveness (ktup = 1). 
Pearson also considered different parameters for each of these 
programs. 

To test the methods, Pearson selected two representative 
proteins from each of 67 protein superfamilies defined by the 
PIR database (9). Each was used as a query to search the 
database, and the matched proteins were marked as being 
homologous or unrelated according to their membership of PIR 

Abbreviation: EPQ, errors per query. 
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superfamilies. Pearson found that modern matrices and "In- 
scaling" of raw scores improve results considerably. He also 
reported that the rigorous Smith-Waterman algorithm worked 
slightly better than FASTA, which was in turn more effective 
than blast. 

Very large scale analyses of matrices have been performed 
(10), and Henikoff and Henikoff (11) also evaluated the 
effectiveness of BLAST and fasta. Their test with BLAST 
considered the ability to detect homologs above a predeter- 
mined score but had no penalty for methods which also 
reported large numbers of spurious matches. The Henikoffs 
searched the swiss-PROT database (12) and used PROSITE (13) 
to define homologous families. Their results showed that the 
BLOSUM62 matrix (14) performed markedly better than the 
extrapolated PAM-series matrices (15), which previously had 
been popular. 

A crucial aspect of any assessment is the data that are used 
to test the ability of the program to find homologs. But in 
Pearson's and the Henikoffs' evaluations of sequence com- 
parison, the correct results were effectively unknown. This is 
because the superfamilies in pir and PROSITE are principally 
created by using the same sequence comparison methods 
which are being evaluated. Interdependency of data and 
methods creates a "chicken and egg" problem, and means for 
example, that new methods would be penalized for correctly 
identifying homologs missed by older programs. For instance, 
immunoglobulin variable and constant domains are clearly 
homologous, but PIR places them in different superfamilies. 
The problem is widespread: each superfamily in pir 48.00 with 
a structural homolog is itself homologous to an average of 1.6 
other pir superfamilies (16). 

To surmount these sorts of difficulties, Sander and Schnei- 
der (17) used protein structures to evaluate sequence com- 
parison. Rather than comparing different sequence compari- 
son algorithms, their work focused on determining a length- 
dependent threshold of percentage identity, above which all 
proteins would be of similar structure. A result of this analysis 
was the HSSP equation; it states that proteins with 25% identity 
over 80 residues will have similar structures, whereas shorter 
alignments require higher identity. (Other studies also have 
used structures (18-20), but these focused on a small number 
of model proteins and were principally oriented toward eval- 
uating alignment accuracy rather than homology detection.) 

A general solution to the problem of scoring comes from 
statistical measures (i.e., E-values and P-values) based on the 
extreme value distribution (21). Extreme value scoring was 
implemented analytically in the BLAST program using the 
Karlin and Altschul statistics (22, 23) and empirical ap- 
proaches have been recently added to fasta and ssearch. In 
addition to being heralded as a reliable means of recognizing 
significantly similar proteins (24, 25), the mathematical trac- 
tability of statistical scores "is a crucial feature of the BLAST 
algorithm" (1). The validity of this scoring procedure has been 
tested analytically and empirically (see ref. 2 and references in 
ref. 24). However, all large empirical tests used random 
sequences that may lack the subtle structure found within 
biological sequences (26, 27) and obviously do not contain any 
real homologs. Thus, although many researchers have sug- 
gested that statistical scores be used to rank matches (24, 25, 
28), there have been no large rigorous experiments on biolog- 
ical data to determine the degree to which such rankings are 
superior. 

A Database for Testing Homology Detection. Since the 
discovery that the structures of hemoglobin and myoglobin are 
very similar though their sequences are not (29), it has been 
apparent that comparing structures is a more powerful (if less 
convenient) way to recognize distant evolutionary relation- 
ships than comparing sequences. If two proteins show a high 
degree of similarity in their structural details and function, it 



is very probable that they have an evolutionary relationship 
though their sequence similarity may be low. 

The recent growth of protein structure information com- 
bined with the comprehensive evolutionary classification in 
the SCOP database (4, 5) have allowed us to overcome previous 
limitations. With these data, we can evaluate the performance 
of sequence comparison methods on real protein sequences 
whose relationships are known confidently. The SCOP database 
uses structural information to recognize distant homologs, the 
large majority of which can be determined unambiguously. 
These superfamilies, such as the globins or the immunoglobu- 
lins, would be recognized as related by the vast majority of the 
biological community despite the lack of high sequence sim- 
ilarity. 

From SCOP, we extracted the sequences of domains of 
proteins in the Protein Data Bank (PDB) (30) and created two 
databases. One (PDB90D-B) has domains, which were all <90% 
identical to any other, whereas (PDB40D-B) had those <40% 
identical. The databases were created by first sorting all 
protein domains in SCOP by their quality and making a list. The 
highest quality domain was selected for inclusion in the 
database and removed from the list. Also removed from the list 
(and discarded) were all other domains above the threshold 
level of identity to the selected domain. This process was 
repeated until the list was empty. The PDB40D-B database 
contains 1,323 domains, which have 9,044 ordered pairs of 
distant relationships, or «*0.5% of the total 1,749,006 ordered 
pairs. In PDB90D-B, the 2,079 domains have 53,988 relation- 
ships, representing 1.2% of all pairs. Low complexity regions 
of sequence can achieve spurious high scores, so these were 
masked in both databases by processing with the SEG program 
(27) using recommended parameters: 12 1.8 2.0. The databases 
used in this paper are available from http://sss.stanford.edu/ 
sss/, and databases derived from the current version of SCOP 
may be found at http://scop.mrc-lmb.cam.ac.uk/scop/. 

Analyses from both databases were generally consistent, but 
PDB40D-B focuses on distantly related proteins and reduces the 
heavy overrepresentation in the PDB of a small number of 
families (31, 32), whereas PDB90D-B (with more sequences) 
improves evaluations of statistics. Except where noted other- 
wise, the distant homolog results here are from PDB40D-B. 
Although the precise numbers reported here are specific to the 
structural domain databases used, we expect the trends to be 
general. 

. Assessment Data and Procedure. Our assessment of se- 
quence comparison may be divided into four different major 
categories of tests. First, using just a single sequence compar- 
ison algorithm at a time, we evaluated the effectiveness of 
different scoring schemes. Second, we assessed the reliability 
of scoring procedures, including an evaluation of the validity 
of statistical scoring. Third, we compared sequence compari- 
son algorithms (using the optimal scoring scheme) to deter- 
mine their relative performance. Fourth, we examined the 
distribution of homologs and considered the power of pairwise 
sequence comparison to recognize them. All of the analyses 
used the databases of structurally identified homologs and a 
new assessment criterion. 

The analyses tested BLAST (1), version 1.4.9MP, and wu- 
BLAST2 (2), version 2.0a 13 MP. Also assessed was the fasta 
package, version 3.0t76 (3), which provided FASTA and the 
SSEARCH implementation of Smith-Waterman (8). For 
ssearch and fasta, we used BLOSUM45 with gap penalties 
-12/-1 (7, 16). The default parameters and matrix (BLO- 
SUM62) were used for BLAST and WU-BLAST2. 

The "Coverage Vs. Error" Plot. To test a particular protocol 
(comprising a program and scoring scheme), each sequence 
from the database was used as a query to search the database. 
This yielded ordered pairs of query and target sequences with 
associated scores, which were sorted, on the basis of their 
scores, from best to worst. The ideal method would have 
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Fig. 1. Coverage vs. error plots of different scoring schemes for ssearch Smith-Waterman. (^4) Analysis of PDB40D-B database. (B) Analysis 
of PDB90D-B database. All of the proteins in the database were compared with each other using the ssearch program. The results of this single 
set of comparisons were considered using five different scoring schemes and assessed. The graphs show the coverage and errors per query (EPQ) 
for statistical scores, raw scores, and three measures using percentage identity. In the coverage vs. error plot, the x axis indicates the fraction of 
all homologs in the database (known from structure) which have been detected. Precisely, it is the number of detected pairs of proteins with the 
same fold divided by the total number of pairs from a common superfamily. PDB40D-B contains a total of 9,044 homologs, so a score of 10% indicates 
identification of 904 relationships. The y axis reports the number of EPQ. Because there are 1,323 queries made in the PDB40D-B all-vs.-all 
comparison, 13 errors corresponds to 0.01, or 1% EPQ. They axis is presented on a log scale to show results over the widely varying degrees of 
accuracy which may be desired. The scores that correspond to the levels of EPQ and coverage are shown in Fig. 4 and Table 1. The graph 
demonstrates the trade-off between sensitivity and selectivity. As more homologs are found (moving to the right), more errors are made (moving, 
up). The ideal method would be in the lower right corner of the graph, which corresponds to identifying many evolutionary relationships without 
selecting unrelated proteins. Three measures of percentage identity are plotted. Percentage identity within alignment is the degree of identity within 
the aligned region of the proteins, without consideration of the alignment length. Percentage identity within both is the number of identical residues 
in the aligned region as a percentage of the average length of the query and target proteins. The hssp equation (17) is H = 290.15/ -0 * 562 where 
/ is length for 10 < / < 80; H > 100 for / < 10; H = 24.7 for / > 80. The percentage identity HSSP-adjusted score is the percent identity within 
the alignment minus H. Smith-Waterman raw scores and E-values were taken directly from the sequence comparison program. 



perfect separation, with all of the homologs at the top of the 
list and unrelated proteins below. In practice, perfect separa- 
tion is" impossible to achieve so instead one is interested in 
drawing a threshold above which there are the largest number 
of related pairs of sequences consistent with an acceptable 
error rate. 

Our procedure involved measuring the coverage and error 
for every threshold. Coverage was defined as the fraction of 
structurally determined homologs that have scores above the 
selected threshold; this reflects the sensitivity of a method. 
Errors per query (EPQ), an indicator of selectivity, is the 
number of nonhomologous pairs above the threshold divided 
by the number of queries. Graphs of these data, called 
coverage vs. error plots, were devised to understand how 



protocols compare at different levels of accuracy. These 
graphs share effectively all of the beneficial features of Re- 
ciever Operating Characteristic (ROC) plots (33, 34) but 
better represent the high degrees of accuracy required in 
sequence__comparison and the huge background of nonho- 
mologs. 

This assessment procedure is directly relevant to practical 
sequence database searching, for it provides precisely the 
information necessary to perform a reliable sequence database 
search. The EPQ measure places a premium on score consis- 
tency; that is, it requires scores to be comparable for different 
queries. Consistency is an aspect which has been largely 

Percent Identity of Unrelated Proteins (PDB90D-B) 




Each point plots the, : tength and. . 
percent identity of an alignment 
between two unrelated proteins . 



Hemoglobin B-chain (1hdsb) Cellulase E2 (1tml_) 

1 hdsb GKVDVDWGAQAUJR—LLVVYFVmjRFFQHFGN^^ 

1tm1_ GQVDALMSAAQAAGKI PILWYNAPGR- - - DCGNHSSGGA PSHSAY-RSWIDEFAAGLKN 

Fig. 2. Unrelated proteins with high percentage identity. Hemo- 
globin /3-chain (pdb code lhds chain b, ref. 38, Left) and cellulase E2 
(pdb code ltml, ref. 39, Right) have 39% identity over 64 residues, a 
level which is often believed to be indicative of homology. Despite this 
high degree of identity, their structures strongly suggest that these 
proteins are not related. Appropriately, neither the raw alignment 
score of 85 nor the E-value of 1.3 is significant. Proteins rendered by 
rasmol (40). 
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Fig, 3. Length and percentage identity of alignments of unrelated 
proteins in PDB90D-B: Each pair of nonhomologous proteins found with 
ssearch is plotted as a point whose position indicates the length and 
the percentage identity within the alignment. Because alignment 
length and percentage identity are quantized, many pairs of proteins 
may have exactly the same alignment length and percentage identity. 
The line shows the hssp threshold (though it is intended to be applied 
with a different matrix and parameters). 
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Reliability of Statistical Scores (PDB90D-B) 
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Fig. 4. Reliability of statistical scores in PDB90D-B: Each line shows 
the relationship between reported statistical score and actual error 
rate for a different program. E-values are reported for ssearch and 
FASTA, whereas P-values are shown for blast and wu-bLastz If the 
scoring were perfect, then the number of errors per query and the 
E-values would be the same, as indicated by the upper bold line. 
(P-values should be the same as EPQ for small numbers, and diverges 
at higher values, as indicated by the lower bold line.) E-values from 
ssearch and fasta are shown to have good agreement with EPQ but 
underestimate the significance slightly, blast and WU-BLAST2 are 
overconfident, with the degree of exaggeration dependent upon the 
score. The results for PDB40D-B were similar to those for PDB90D-B 
despite the difference in number of homologs detected. This graph 
could be used to roughly calibrate the reliability of a given statistical 



ignored in previous tests but is essential for the straightforward 
or automatic interpretation of sequence comparison results. 
Further, it provides a clear indication of the confidence that 
should be ascribed to each match. Indeed, the EPQ measure 
should approximate, the expectation value reported by data- 
base searching programs, if the programs' estimates are accu- 
rate. 

The Performance of Scoring Schemes. All of the programs 
tested could provide three fundamental types of scores. The 
first score is the percentage identity, which may be computed 
in several ways based on either the length of the alignment or 
the lengths of the sequences. The second is a "raw" or 
"Smith-Waterman" score, which is the measure optimized by 
the Smith-Waterman algorithm and is computed by summing 
the substitution matrix scores for each position in the align- 
ment and subtracting gap penalties. In BLAST, a measure 

Sequence Comparison Algorithms (PDB40D-B) 
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related to this score is scaled into bits. Third is a statistical 
score based on the extreme value distribution. These results 
are summarized in Fig. 1. 

Sequence Identity. Though it has been long established that 
percentage identity is a poor measure (35), there is a common 
rule-of-thumb stating that 30% identity signifies homology. 
Moreover, publications have indicated that 25% identity can 
be used as a threshold (17, 36). We find that these thresholds, 
originally derived years ago, are not supported by present 
results. As databases have grown, so have the possibilities for 
chance alignments with high identity; thus, the reported cutoffs 
lead to frequent errors. Fig. 2 shows one of the many pairs of 
proteins with very different structures that nonetheless have 
high levels of identity over considerable aligned regions. 
Despite the high identity, the raw and the statistical scores for 
such incorrect matches are typically not significant. The prin- 
cipal reasons percentage identity does so poorly seem to be 
that it ignores information about gaps and about the conser- 
vative or radical nature of residue substitutions. 

From the PDB90D-B analysis in Fig. 3, we learn that 30% 
identity is a reliable threshold for this database only for 
sequence alignments of at least 150 residues. Because one 
unrelated pair of proteins has 43.5% identity over 62 residues, 
it is probably necessary for alignments to be at least 70 residues 
in length before 40% is a reasonable threshold, for a database 
of this particular size and composition. 

At a given reliability, scores based on percentage identity 
detect just a fraction of the distant homologs found by 
statistical scoring. If one measures the percentage identity in 
the aligned regions without consideration of alignment length, 
then a negligible number of distant homologs are detected. 
Use of the HSSP equation improves the value of percentage 
identity, but even this measure can find only 4% of all known 
homologs at 1% EPQ. In short, percentage identity discards 
most of the information measured in a sequence comparison. 

Raw Scores. Smith-Waterman raw scores perform better 
than percentage identity (Fig. 1), but ln-scaling (7) provided no 
notable benefit in our analysis. It is necessary to be very precise 
when using either raw or bit scores because a 20% change in 
cutoff score could yield a tenfold difference in EPQ. However, 
it is difficult to choose appropriate thresholds because the 
reliability of a bit score depends on the lengths of the proteins 
matched and the size of the database. Raw score thresholds 
also are affected by matrix and gap parameters. 

Statistical Scores. Statistical scores were introduced partly 
to overcome the problems that arise from raw scores. This 
scoring scheme provides the best discrimination between 
homologous proteins and those which are unrelated. Most 

Sequence Comparison Algorithms (PDB90D-B) 
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Fig. 5. Coverage vs. error plots of different sequence comparison methods: Five different sequence comparison methods are evaluated, each 
using statistical scores (E- or P-values). {A ) PDB40D-B database. In this analysis, the best method is the slow ssearch, which finds 18% of relationships 
at 1% EPQ. fasta letup = 1 and WU-BLAST2 are almost as good. (B) PDB90D-B database. The quick wu-BlAST2 program provides the best coverage 
at \% EPQ on this database, although at higher levels of error it becomes slightly worse than fasta ktup = 1 and ssearch. 
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likely, its power can be attributed to its incorporation of more 
. information than any other measure; it takes account of the 
full substitution and gap data (like raw scores) but also has 
details about the sequence lengths and composition and is 
scaled appropriately. 

We find that statistical scores are not only powerful, but also 
easy to interpret. SSEARCH and FASTA show close agreement 
between statistical scores and actual number of errors per 
query (Fig. 4). The expectation value score gives a good, 
slightly conservative estimate of the chances of the two se- 
quences being found at random in a given query. Thus, an 
E-value of 0.01 indicates that roughly one pair of nonhomologs 
of this similarity should be found in every 100 different queries. 
Neither raw scores nor percentage identity can be interpreted 
in this way, and these results validate the suitability of the 
extreme value distribution for describing the scores from a 
database search. 

The P-values from BLAST also should be directly interpret- 
able but were found to overstate significance by more than two 
orders of magnitude for 1% EPQ for this database. Nonethe- 
less, these results strongly suggest that the analytic theory is 
fundamentally appropriate. WU-BLAST2 scores were more re- 
liable than those from blast, but also exaggerate expected 
confidence by more than an order of magnitude at 1% EPQ. 

Overall Detection of Homologs and Comparison of Algo- 
rithms. The results in Fig. 5A and Table 1 show that pairwise 
sequence comparison is capable of identifying only a small 
fraction of the homologous pairs of sequences in PDB40D-B. 
Even SSEARCH with E-values, the best protocol tested, could 
find only 18% of all relationships at a 1% EPQ. blast, which 
identifies 15%, was the worst performer, whereas fasta 
ktup = 1 is nearly as effective as SSEARCH. fasta ktup = 2 and 
WU-BLAST2 are intermediate in their ability to detect ho- 
mologs. Comparison of different algorithms indicates that 
those capable of identifying more homologs are generally 
slower. SSEARCH is 25 times slower than blast and 6.5 times 
slower than fasta ktup = 1. WU-BLAST2 is slightly faster than 
fasta ktup = 2, but the latter has more interpre table scores. 

In PDB90D-B, where there are many close relationships, the 
best method can identify only 38% of structurally known 
homologs (Fig. 5B). The method which finds that many 
relationships is WU-BLAST2. Consequently, we infer that the 
differences between fasta kup = 1, ssearch, and wu -blast? 
programs are unlikely to be significant when compared with 
variation in database composition and scoring reliability. 

Fig. 6 helps to explain why most distant homologs cannot be 
found by sequence comparison: a great many such relation- 
ships have no more sequence identity than would be expected 
by chance, ssearch with E-values can recognize >90% of the 
homologous pairs with 30-40% identity. In this region, there 
are 30 pairs of homologous proteins that do not have signif- 
icant E-values, but 26 of these involve sequences with <50 
residues. Of sequences having 25-30% identity, 75% are 
identified by ssearch E-values. However, although the num- 
ber of homologs grows at lower levels of identity, the detection 
falls off sharply: only 40% of homologs with 20-25% identity 
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Fig. 6. Distribution and detection of homologs in PDB40D-B. Bars 
show the distribution of homologous pairs PDB40D-B according to their 
identity (using the measure of identity in both). Filled regions indicate 
the number of these pairs found by the best database searching method 
(ssearch with E-values) at 1% EPQ. The PDB40D-B database contains 
proteins with <40% identity, and as shown on this graph, most 
structurally identified homologs in the database have diverged ex- 
tremely far in sequence and have <20% identity. Note that the 
alignments may be inaccurate, especially at low levels of identity. Filled 
regions show that ssearch can identify , most relationships that have 
25% or more identity, but its detection wanes sharply below 25%. 
Consequently, the great sequence divergence of most structurally 
identified evolutionary relationships effectively defeats the ability of 
pariwise sequence comparison to detect them. 

are detected and only 10% of those with 15-20% can be found. 
These results show that statistical scores can find related 
proteins whose identity is remarkably low; however, the power 
of the method is restricted by the great divergence of many 
protein sequences. 

After completion of this work, a new version of pairwise 
BLAST was released: blastgp (37). It supports gapped align- 
ments, like WU-BLAST2, and dispenses with sum statistics. Our 
initial tests on blastgp using default parameters show that its 
E-values are reliable and that its overall detection of homologs 
was substantially better than that of ungapped blast, but not 
quite equal to that of WU-BLAST2. 

CONCLUSION 

The general consensus amongst experts (see refs. 7, 24, 25, 27 
and references therein) suggests that the most effective se- 
quence searches are made by (*) using a large current database 
in which the protein sequences have been complexity masked 
and («) using statistical scores to interpret the results. Our 
experiments fully support this view. 

Our results also suggest two further points. First, the E-val- 
ues reported by fasta and ssearch give fairly accurate 
estimates of the significance of each match, but the P-values 
provided by blast and WU-BLAST2 underestimate the true 



Table 1. Summary of sequence comparison methods with PDB40D-B 



Method 


Relative Time* 


1% EPQ Cutoff 


Coverage at 1% EPQ 


ssearch % identity: within alignment 


25.5 . 


>70% 


<0.1 


ssearch % identity: within both 


25.5 


34% 


3.0 


ssearch % identity: HSSP-scaled 


25.5 


35% (hssp + 9.8) 


4.0 


SSEARCH Smith-Waterman raw scores 


25.5 


142 


10.5 


ssearch E-values 


25.5 


0.03 


18.4 


fasta ktup = 1 E-values 


3.9 


0.03 


17.9 


fasta ktup = 2 E-values 


1.4 


0.03 


16.7 


WU-BLAST2 P-values 


1.1 


0.003 


17.5 


blast P-values 


1.0 


0.00016 


14.8 


♦Times are from large database searches with genome proteins. 
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extent of errors. Second, ssearch, WU-BLAST2, and fasta 
ktup = 1 perform best, though blast and fasta ktup = 2 
detect most of the relationships found by the best procedures 
and are appropriate for rapid initial searches. 

The homologous proteins that are found by sequence com- 
parison can be distinguished with high reliability from the huge 
number of unrelated pairs. However, even the best database 
searching procedures tested fail to find the large majority of 
distant evolutionary relationships at an acceptable error rate. 
Thus, if the procedures assessed here fail to find a reliable 
match, it does not imply that the sequence is unique; rather, it 
indicates that any relatives it might have are distant ones.** 



** Additional and updated information about this work, including 
supplementary figures, may be found at http://sss.stanford.edu/sss/. 
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Three-Dimensional Structure of 
Myosin Subf ragment- 1 : 
A Molecular Motor 
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Directed movement is a characteristic of many Jiving organisms and occurs as a result of 
the transformation of chemical energy into mechanical energy. Myosin is one of three 
families of molecular motors that are responsible for cellular motility. The three-dimensional 
structure of the head portion of myosin, or subfragment-1, which contains both the actin 
and nucleotide binding sites, is described. This structure of a molecular motor was de- 
termined by single crystal x-ray diffraction. The data provide a structural framework for 
understanding the molecular basis of motility. 



M otilicy is one of the characteristic fea- 
tures of many living organisms and involves 
the transduction of chemical into mechan- 
ical energy. Only a limited number of strat- 
egies have evolved to accomplish this task. 
At present, three major classes of molecular 
motors have been identified, myosin, dy- 
ncin, and kinesin, and all are important in 
cellular movement (J). Of these three pro- 
teins, the most abundant is myosin, which 
plays both a structural and an enzymatic 
role in both muscle contraction and intra- 
cellular motility. 
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The role of myosin in movement has 
been most clearly defined from the study of 
cross-striated skeletal muscle, which shows 
a high degree of structural organization. In 
striated muscle the basic contractile unit is 
the sarcomere, which consists of overlap- 
ping arrays of thick and thin filaments. 
During contraction, these filaments, which 
are composed primarily of myosin and ae- 
on, respectively, slide past one another, 
thereby shortening the length of the sarco- 
mere (2). Electron micrographs of muscle 
in rigor have revealed connections between 
the filaments in the overlap region, the 
so-called crossbridges. These crossbridges 
are formed by the globular regions of the 
myosin molecule and are responsible for 
force generation in the contractile process 
through the hydrolysis of adenosine triphos- 
phate (ATP). 

Myosin, which has a molecular size of 
about 520 Idlodaltons, consists of two 220- 
kD heavy chains and two pain of light 
chains that vary in molecular size depend- 
ing on the source but are usually between 
15 and 22 kD (3, 4). The molecule is highly 
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I asymmetric, consisting of two globular 
; heads attached to a long tail. Each heavy 
chain forms the bulk of one head and 
intertwines with its neighbor to form the 
tail . Limited proteolytic digestion has 
shown that the myosin head, or subfrag- 
ment-1 (SI), contains an ATP, actin, and 
rwo light chain binding sites and that the 
myosin rod, which is formed by a coiled coil 
of two a helices, accounts for the self- 
association of myosin at low ionic strength 
and the formation of the thick filament 
backbone (3) . Spudich and co-workers 
have demonstrated that the globular head 
portions of myosin are sufficient to generate 
movement of actin in an in vitro motility 
assay (5). 

Each globular head, derived from limit- 
ed proteolysis, consists of a heavy chain 
fragment having a molecular size of 95 kD 
and rwo light chains yielding a combined 
molecular si2e of —130 kD (6). The two 
light chains differ in their structure and 
properties and are known by a variety of 
names. In this article they are referred to as 
the regulatory and essential light chains. 
Neither type is required for the adenosine 
triphosphatase (ATPase) activity of the 
head (7) . In some species, however, these 
chains regulate or modulate the ATPase 
activity of myosin in the presence of acrin 
(8, 9). Amino acid sequence analyses re- 
veal that both light chains share consider- 
able sequence similarity with calmodulin 
and troponin C although most of the diva- 
lent cation binding sites have been lost 
during evolution (JO). 

During the last 40 years, enormous effort 
has been expended toward understanding 
the structure and function of the myosin 
head (J J). Measurements from electron 
micrographs have suggested that the myosin 
head is pear-shaped, about 190 A long and 
50 A wide at its thickest point (12). Mo- 
lecular dimensions subsequently derived 
from studies of fixed thin sections cut from 
crystals of myosin Si were consistent with 
these observations (13). 

Although much biochemical and physi- 
cal information has accumulated for myosin 
since 1950, structural knowledge of this 
protein or any other molecular motor has 
been lacking. We now describe the tertiary 
structure of the myosin head and suggest 
how this protein may serve to transduce 
energy from the hydrolysis of ATP into 
directed movement. We present the three- 
dimensional structure of myosin SI at a 
nominal resolution of 2.8 A and refinement 
R factor of 22.3 percent for all x-ray data 
recorded in that range. 

Crystallization of myosin subfragment- 
1 . Myosin is an abundant protein that can be 
easily prepared in gram quantities. Likewise, 
the myosin head, which is readily cleaved 
from the rest of the molecule by mild prote- 



olysis, can be prepared in large quantities. 
This soluble subfragment has been known 
for approximately 30 years and has resisted 
crystallization despite numerous attempts. In 
view of its central importance for under- 
standing the molecular basis of muscle con- 
traction, we undertook an alternative ap- 
proach to the usual ways of obtaining x-ray 
quality crystals. The protein was first sub- 
jected to mild chemical modification of the 
lysine residues by reductive methylation. 
This chemical modification has long been 
used as a gentle way to introduce a radioac- 
tive label into a protein (14). 

Considerable effort was expended to 
determine the optimal procedure for mod- 
ifying the protein since it was recognized 
that complete, homogeneous modification 
of the molecule was essential for obtaining 
high-quality crystals (Table 1). Many of 
the experiments necessary to derive the 
optimal protocol for methylation were per- 
formed in a parallel study on hen egg 
white lysozyme (15). In that study the 
three-dimensional structure of the modi- 
fied protein was determined and refined to 
1.8 A resolution and shown to be essen- 
tially identical to that of the native pro- 
tein except for the modified lysine resi- 
dues. Modification of the lysine residues in 

Table 1. Amino acid analysis of modified and 
native myosin Si (60), Prior to modification, the 
protein, at 5 mg/ml, was diatyzed against 200 
mM potassium phosphate, pH 7.5, 1 mM 
MgClg. The protein was reductively methylated 
at 4°C by the sequential addition of 1 M dimeth- 
yiamine borane complex dissolved in water (20 
M-l per milliliter of protein) and 1 M formalde- 
hyde (40 per milliliter of protein) with rapid 
stirring. This process was repeated after 2 
hours; a further portion (10 jU/ml) of dimeth- 
ylamine borane complex was added after 2 
hours and the reaction mixture was kept over- 
night at 4°C in the dark, The reaction was 
quenched by the addition of 3.8 M ammonium 
sulfate to a final concentration of 1 M and then 
diaJyzed for 48 hours against 2.5 M ammonium 
sulfate, 50 mM potassium phosphate at pH 6.7 
to precipitate the protein (75, 49). All except 
three to four of the lysine residues were modi- 
fied. Discrepancy between the total number of 
lysine residues in the native and modified pro- 
tein may have arisen from a calibration error in 
the dimethyllysine standard. The analyses for 
histidine, methionine, and arginine are shown 
as controls. 



Amino Residues (no.) 



acid 


Theoretical 


Native 


Modified 


Lysine 


103 


96.2 


4.2 


Me^Lys 


0 


0 


0 


Me-rLys 


0 


0.6 


96.7 


Me^-Lys 


3 


3.6 


3.4 


Total lysine 


106 


100.4 


104.3 


Histidine 


24 


23.7 


23.2 


Methionine 


39 


394 


38.9 


Arginine 


46 


47.5 


47.2 



lysozyme dramatically changed its crystal- 
lization properties. The kinetic and struc- 
tural effects of this treatment on myosin 
SI are discussed below. 

Myosin isolated from chicken pectoralis 
muscle consists of a mixed population of 
rwo isozymes caused by the existence of two 
species of the essential light chain (16). 
These light chains are referred to as Al (21 
kD) and A2 (16 kD) . Amino acid sequence 
studies of the light chains have demonstrat- 
ed that Al and A2 are identical over their 
142 residues at the COOH-terroinus. The 
size difference is caused by an additional 41 
amino acids present at the NH 2 -terminus of 
Al. These isozymes arise by alternative 
transcription and two modes of splicing 
from a single gene (17). 

The crystals used in our study contained 
both tsoforms of the essential light chain. 
Myosin Si was prepared by digestion with 
papain in the presence of MgCl 2 because the 
fragment produced under these conditions 
contained both the regulatory and essential 
light chains. The major drawback of papain 
as a proteolytic enzyme » however, was its 
lack of specificity- Apart from cleaving the 
heavy chain at the head-rod junction, addi- 
tional proteolytic breaks were introduced 
into both the regulatory and Al essential 
light chains. Also, there was partial phos- 
phorylation of the regulatory light chain by 
endogenous myosin light chain kinase. The 
myosin SI was prepared by an improved 
purification protocol that removed the het- 
erogeneity arising from both proteolysis of 
the light chains and phosphorylation of the 
regulatory light chain (18). 

Crystals were grown by batch methods 
from 1.35 M ammonium sulfate, 500 mM 
potassium chloride, and 50 mM potassium 
phosphate (pH 6.7) in the presence of 5 
mM dithiothreitol and 0.5 mM sodium 
azide at a final protein concentration of 8 
to 12 mg/ml. Crystallization was initiated 
by microseeding, and the crystals grew as 
thick rods to a length of 1 to 2 mm and a 
width and thickness of 0-4 and 0.3 mm, 
respectively, over a period of 2 to 3 
months at 4°C. They belonged to the 
space group C222 x with unit cell dimen- 
sions of a = 98.4, b = 124.2, c = 274.9 
A, and one molecule in the asymmetric 
unit. These crystals were different from 
those originally reported (19) and arose 
from improvements in both the chemical 
modification procedure and the protein 
homogeneity. 

Structure determination. The x-ray 
data were collected in two stages (20). 
First, x-ray data sets to 4.5 A resolution for 
the native and heavy atcrrn— containing 
crystals were recorded by an area detector 
with the goal of determining the positions 
of the metal binding sites. These data were 
then extended to 2.8 A resolution with 
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synchrotron radiation at Stanford Univer- 
sity (SSRL) and Cornell University 
(CHESS). We recognized early that x-ray 
data collection and determination of the 
protein phases by multiple isomorphous re- 
placement would be difficult unless care was 
taken to nunimize the systematic errors 
introduced by differences between the suc- 
cessive protein preparations. Consequently, 
for each stage in the heavy atom derivative 
data collection, a corresponding native data 
set was recorded from the same protein 
preparation. For each purification trial, ap- 
proximately 700 mg of myosin Si was pre- 
pared and set up for crystallization. Many 
attempts were made before a single prepa- 
ration yielded sufficient crystals for x-ray 
data collection. 

The structure was determined by a com- 
bination of multiple isomorphous replace- 
ment and solvent flattening. The first de- 
rivative solved was obtained from crystals 
soaked in trimethyllead acetate and proved 
to be highly isomorphous with only four 
binding sites (2 J ) . It was used to determine 
the positions of the other heavy atom bind- 
ing sices by difference Fourier techniques 
(Table 2). The positions and occupancies 
of the heavy atom sites were refined accord- 
ing to the origin-removed Patterson-func- 
tion correlation method by the program 
HEAVY (22). The overall figures of merit 
for the area detector, CHESS, and SSRL 
synchrotron data were 0.47, 0.58, and 
0.42, respectively. 

The higher resolution x-ray data collect- 
ed at SSRL were placed on the same scale as 
the area detector data and included as a 
block from 4.5 to 2.8 A. Efforts to merge 
the overlapping data between the area de- 
tector and synchrotron data were unsatis- 
factory. However, the phase information 
from all three sources was combined 
throughout the entire resolution range via 
the phase probability coefficients (23) . 




Flfl. 1. Ramachandran plot of the main chain 
dihedral angles of ail non-glycinyl residues in 
the model presented. 



These protein phases were improved by 
solvent flattening (24). The positions and 
occupancies of heavy atom binding sites 
were further refined against these modified 
phases (25). This gave an improved elec- 
tron density map into which approximately 
550 alanine residues were built with the 
program FRODO (26). The map showed 
good connectivity and many well-defined 
side chains. 

Once several long segments were con- 
nected, the positions of these alanine resi- 
dues were matched to the known amino 
acid sequence (27, 28). At this stage phase 
information from the partial model was 
combined with the heavy atom derivative 
phases by the program SIGMAA (29). The 
structure was refined concurrendy with the 
model building process by the program 
package TNT (30). Once the model build- 
ing was near completion, a cycle of refine- 
ment with X-PLOR (31) was performed to 
improve the conformations of the side 
chains. The strategy of alternate model 




Rg. 2. A stereo view of a representative section of electron density located in the seven-stranded 
0 sheet motif of the heavy chain calculated with SIGMAA coefficients (291 The phases and weights 
used to calculate the electron density were obtained by combining the information from the heavy 
atom phases and those derived from the atomic model. 



Table 2. Heavy atom derivatives used in the structure determinaiion and their data collection 
statistics. 



Conditions* 



Derivative 


Concen- 
tration 
(mM) 


Time 
(days) 


Method 




Reflec- 
tions 
(no.) 


Resolu- 
tion 
(A) 




Sites 
(no.) 


Phas- 
ing 

pow- 
ers 


Trimethyllead 


20 


21 


Area detector 


6.7 


13,394 


3,5 


24.5 


4 


1.01 


acetate 
















KAu(CN) 2 


1 


5 


Area detector 


5.B 


18,082 


3.5 


18.2 


6 


1.14 


K 2 OsO«/pyridine 


2-20 


2 


Area detector 


6.7 


11,804 


4.0 


24.5 


4 


1.02 


KaUOjFs 


3 


4 


Area detector 


6.7 


11.688 


4.0 


19.1 


7 


1.06 


Trimethytlead 


20 


21 


CHESS 


11.5 


34,498 


2.8 


28.2 


4 


1.19 


acetate 
















KAu(CN) 2 


T 


5 


CHESS 


9,7 


36,339 


2.8 


17.0 


8 


1.23 


KjOsOVpyridine 


2-20 


2 


CHESS 


10.0 


32.554 


2.8 


32.2 


8 


0.99 


Cis-Pt(NH3) 2 Cl 2 


2 


3 


CHESS 


10.3 


36,108 


2.8 


21 .0 


12 


1.12 


K3U0 2 F 6 


3 


4 


CHESS 


10.9 


36.419 


2.8 


22.7 


9 


0.98 


Trimethyllead 


15 


21 


SSRL 


13.0 


31,667 


2.8 


27.1 


4 


1.11 


acetate 
















KJJOf s 


2 


3 


SSRL 


11.8 


33,043 


2.8 


22.2 


6 


0.88 



•The heavy atom derivatives wore prepared at 4-C by first slowly transferring me crystals to a synthetic mother 
liquor composed of 1.5 M ammonium sulfate, 600 mM KO buttered with 20 mM Pipes at pH 6.7. tflLn - 
- **J)/2n4* x 100, where ^ and 4, are the intensities of the tndMdueJ and mean structure 
factors. tR^* ° I(lf h i - ifjjyi^ x 100, where F h and f n are the heavy atom and native structure 
factors. §The phasing power is defined as the mean value of the heavy atom structure (actor divided by me 
residual lack-of -closure error. 



building and refining proved successful and 
constantly improved the estiroation of the 
protein phases. Toward the end of the 
analysis there were clear segments of elec- 
tron density corresponding to portions of 
the light chains that were completely miss- 
ing in the original maps phased with heavy 
atom derivatives alone. 

At present, 1072 residues (of a total of 
1157) have been builr into the electron 
density map. The model was refined to an R 
fee tor of 22.3 percent for all measured x-ray 
data between 30 to 2.8 A with root-mean - 
square deviations from ideal geometry of 
0.018 A for bond lengths, 2.5° for bond 
angles, and 0.013 A for groups of atoms 
expected to be coplanar. No solvent mole- 
cules have yet been built into the electron 
density (Figs. 1 and 2) . 

Structure description. In a space-filling 
representation of all atoms in the myosin SI 
model (Fig. 3), the green, red, and blue 
segments represent parts of the heavy chain 
and the yellow and magenta stretches cor- 
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respond to the essential and regulatory light 
chains, respectively- As can be seen, the 
myosin head is highly asymmetric with a 
length of 165 A, a width of 65 A, and a 
thickness of approximately 40 A. 

Previous knowledge of the organization 
of the heavy chain in the myosin head was 
derived from proteolytic studies. Limited 
cryptic digestion of vertebrate skeletal Si 
indicated that the head contained three 
major regions: a 25-kD NH 2 - terminal nu- 
cleotide binding region (32), a central 50- 
IcD segment, and a 20-kD COOH-terminal 
segment; the last two were shown to bind to 



Rg. 3. A space-filling 
representation of all of 
the atoms in the current 
model of myosin S1 . The 
model is oriented such 
that the actin binding 
surface is located at the 
tower right-hand comer. 
The 25-, 50-, and 20-kD 
segments of the heavy 
chain are colored in 
green, red, and blue, re- 
spectively, whereas the 
essentia! and regulatory 
light chains are shown m 



actin (33, 34). These proteolytic segments 
are displayed in green, red, and blue, re- 
spectively (Fig. 3); the light chains abut 
one another and are wrapped around a 
single a helix of the heavy chain but do not 
overlap to any significant extent. 

The secondary structure of the myosin 
head is dominated by a helices with ap- 
proximately 48 percent of the amino acid 
residues in this conformation (Figs. 4 and 
5). One key structural feature is the long 
(approximately 85 A) a helix which ex- 
tends from the thick pan of the head down 
to the COOH- terminus of the heavy chain. 




yellow and magenta, respectivery. In this orientation the prominent horizontal cleft that divides the central 
50-kD segment of the heavy chain into two domains (upper and lower defined by this orientation) is 
clearly visible. This figure was prepared with the molecular graphics program MIDAS {61). 



This a helix constitutes the light chain 
binding region of the heavy chain. There is 
a bend, delineated by amino acid residues 
Trp 829 , Pro" 0 , Tip" 1 , and Met 8 ", which 
connects this long a helix to a short 
COOH-terminal a helix of the 95-kD 
heavy chain fragment. A brief description 
of the three polypeptide chains constituting 
the myosin head is given below. 

The regulatory light chain is located at 
the end of the molecule distal from the 
nucleotide binding site (Figs. 4 and 5). It 
consists of two domains and shares consid- 
erable structural homology with calmodulin 
and troponin C except that the long con- 
necting helix observed in calmodulin and 
troponin C is distorted (35, 36). A com- 
parison of the regulatory light chain with 
calmodulin is shown in Fig. 6A where the 
eight helices that comprise the two domains 
have been labeled A through H. The reg- 
ulatory light chain is arranged such that its 
NH 2 -terrninal domain wraps around the 
COOH-terrninus of the heavy chain be- 
tween amino acid residues Asn 625 and 
Leu 842 whereas its COOH-terminal domain 
interacts with the heavy chain in the region 
defined by amino acid residues Glu 806 to 
Val 826 - The interaction of the NH 2 -termi- 
nal domain with the heavy chain is stabi- 
lized by a cluster of hydrophobic residues 
including nine phenylalanines, two tryp* 




Rg. 4. A ribbon representation of the entire 
model for myosin S1 . In this and all successive 
figures. 2000 and 3000 have been added to the 
residue numbers of the regulatory and essential 
light chains, respectively, to distinguish these 
from the heavy chain. Heavy chain residues 
Asp- to Glu 204 , Gly 216 to Tyr*» and Gin 847 to 
Lys 843 are colored in green, red, and blue, 
respectivery. These segments are separated by 
disordered loops for which no density is evident 
in the current map. There are two additional 
segments rn the heavy chain for which the 
density is weak or disordered These include 
residues Lys 572 to Lys 574 and lie 732 to Phe 737 . 
The A2 isozyme of the essential light chain, 
shown in yellow, theoretically contains 1 49 ami- 
no acid residues. In the model it extends from 
residue Asp 5 to Val 149 and contains one ill- 
defined region that includes residues Leu 50 to 
Ala 00 . The regulatory light chain, which is col- 
ored in magenta, theoretically consists of 166 
amino acid residues. In the current model it 
extends from residue Phe 10 to Lys 163 but is 
disordered between residues Pro 142 and 
Asn 147 In this figure the molecule is oriented 
perpendicular to its long axis and rotated to 
view along the active site pocket. A sulfate ion, 
shown here in a space-filling representation, is 
located at the base of the pocket. The actin 
binding surface has been defined as indicated 
on the figure by the location of the 50- to 20-kD 
junction (residues Tyr* 26 and Gin 647 ) and by its 
interaction with actin (46). Figures 4 to 7 were 



prepared with the molecular graphics program 
MOLSCRIPT (62). 
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tophans, and four methionines. Five of 
these residues are contributed by the heavy 
chain. Superposition of the NH 2 -terminal 



domains of the regulatory light chain and 
calmodulin reveals an rms difference in the 
positions of 59 equivalent residues of only 




Rg. 5. A stereo a carbon plot of the entire myosin head in which the view has been rotated 90° with 
respect to Fig. 4. In this view, the active site pocket is seen as a wide depression. Selected residues have 
been labeled to allow the path of the chain to be followed and to identify the start and end of the 
secondary structural elements. 



Regulatory Light Chain 




Calmodulin 



Essential Light Chain 



Rg. 6. (A) and (C) show ribbon representations of the regulatory and essential light chains together 
with the segment of the heavy chain with which they interact. The light chains are oriented such that 
the NH^terminal domains have the same orientation as calmodulin shown in (B). The coordinates for 
calmodulin were taken from the Brookhaven Protein Data Bank (file 3CLN) from the structure 
determined by Cook and co-workers (63). 
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1.3 A. By contrast the COOH- terminal 
domain is less similar to the structure ob- 
served in calmodulin and is due to a differ- 
ence in the positions of the F and G helices 
in the regulatory light chain that have 
moved to accommodate the heavy chain. 
In addition, the COOH-terminaJ domain 
as a whole has rotated, relative to calmod- 
ulin, about the midpoint between the two 
domains in order to form a tight complex 
with the heavy chain. 

The divalent cation binding site is locat- 
ed in the first helix-loop-helix motif ob- 
served in the amino acid sequence and, as 
indicated above, has a conformation similar 
to that observed in calmodulin. A divalent 
cation is clearly evident in the electron 
density and is most likely Mg 2 * in that this 
was a minor constituent of the crystalliza- 
tion buffer, in our model, no electron den- 
sity was observed for the first 18 amino acid 
residues in the regulatory light chain. This 
includes Ser 13 and is, by sequence homolo- 
gy to rabbit myosin, the site of phospho- 
rylation by myosin light chain kinase (37). 
Presumably this porrion of the polypeptide 
chain is flexible in SI and perhaps only 
plays a functional role when the head is 
attached to the remainder of the molecule. 
The observed NH 2 - and COOH-terminal 
residues of the regulatory light chain lie 
close to the interface between the two 
domains. 

The essential light chain interacts with 
the long a helix of the heavy chain through 
amino acid residues Leu 783 to Met 806 (Fig. 
6C). Likewise, it wraps around the heavy 
chain a helix but in a manner different - 
from that observed for the regulatory light 
chain. Its arrangement resembles that for 
the interaction of calmodulin with a target 
peptide from myosin light chain kinase 
(38) . It differs in that the second and third 
helices in the NH 2 - terminal domain abut 
the heavy chain with their external surfac- 
es, whereas the corresponding secondary 
structural elements in calmodulin enclose 
the respective target peptide. The electron 
density for this part of the molecule is the 
least well ordered of the entire map. In- 
deed, very little of the essential light chain 
was visible in the original electron density 
map and only appeared after the phase 
information from the rest of the molecule 
was included. This could be due to either 
lack of isomorphism in the heavy atom 
derivative phases or conformational flexi- 
bility of the polypeptide chain. It is difficult 
to distinguish between these two possibili- 
ties because the crystals contain both class- 
es of essential light chain isoforms. As with 
the regulatory light chain, the NH 2 - and 
COOH-terminal residues lie close to the 
interface between the two domains. 

The heavy chain constitutes the entire 
thick portion of the myosin head and con- 



Rg. 7. A stereo ribbon and a carbon plots of the 
catalytic portion of the myosin head centered on the 
active site. In (A) and (B) the actin binding face, as 
defined by the position of the 50- to 20-kD junction, is 
located on the far side of the molecule. In (C) the 
molecule has been rotated 90° about the horizontal 
axis to reveal more clearly the relation between the 
active site pocket and the reactive cysteine residues. 
(A) A larger segment of the myosin head that reveals 
the overall disposition of the secondary structural 
elements around the nucleotide binding site. The 
upper domain of the 50-kD segment is shaded in gray 
whereas the lower domain is shaded in black to 
emphasize the narrow cleft thai divides them. (B) A 
more detailed view of the residues that form the 
interface between the upper and lower domains of the 
50-kD segment. Marker residues are Identified that 
allow all other residues to be located. In addition, a 
few of the side chains for the residues that have been 
implicated to be important in the catalytic mechanism, 
from amino acid sequence analyses and from chem- 
ical studies, have been included. Residues Trp 131 and 
Ser 35 * that have been identified from photolabeltng 
studies lie on opposite sides of the nucleotide binding 
pocket. (C) The helix connecting the reactive cys- 
teines, Cys 707 and Cys 697 , lies at the base of a deft at 
the junction between the lower domain of the 50-kD 
segment and the NH z -termrnal 25-kD segment. 



tains both the nucleotide binding site and 
actin binding region. These are located on 
opposite sides of the protein. This part of 
the molecule contains a complex arrange- 
ment of secondary structural elements cen- 
tered mainly around a large, mostly paral- 
lel, seven-stranded 0 sheet motif. The to- 
pology of this 0 sheet is such that strands 
one and six run in the opposite direction to 
the other five strands. The central strand 
corresponds to the strand-loop-helix bind- 
ing motif, which has the sequence GES* 
GAGKT (39), observed both in adenylate 
kinase and the Ras protein (40). The to- 
pology and organization of the heavy chain 
are described below in terms of the three 
major tryptic fragments. However, these 
fragments arise from proteolytic cleavage at 
flexible loops and do not represent discrete 
structural domains. 

The first observed residue at the NH 2 - 
terminus of the heavy chain is Asp 4 and is 
located close to the essential light chain at the 
approximate center of the entire myosin mol- 
ecule (Figs. 4 and 5). From here the heavy 
chain crosses the width of the molecule and 
forms a small six-stranded anriparallel 0 sheet 
motif (Lys 35 to Met 80 ), which is fairly inde- 
pendent of the rest of the head and protrudes 
from the molecule as a whole. The function of 
this domain is unknown although it does not 
appear essential for motility in that it is 
missing in several single-headed myosin 
I-cype molecules (41). The topology of this 
sheet is similar to that of the Src-homology 3 
domain observed in spectrin (42). After this 
morif, the heavy chain forms three strands of 
the large 0 sheet morif that are connected by 
a series of a helices. The first two strands 




extend from Tyr 116 to Tyr 118 and from Cys 123 
to Val 126 and are connected by a 0 turn. 
Thereafter there are three short helices prior 
to the fourth 0 strand in the sheet that 
extends from Gin 173 to Gry 179 . The third 
stiand belongs to the COOH-tertrunal 20-kD 
fragment of the heavy chain fragment. The 
fourth or central strand precedes the phos- 
phate binding loop and is followed by a helix, 
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Lys 185 to lie 199 , which forms the base of the 
nucleotide binding pocket* The topology of 
this loop is essentially identical to that in the 
Ras protein and adenylate kinase (40). A 
sulfate ion is embedded in the phosphate 
binding loop and is located close to the 
position of the 0 phosphate observed in 
the complex between Ap 5 A |P i l P 5 l bis- 
(adenosine-5'-) pentaphosphatej and adenyi- 
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ate kinase (Figs. 4 and 5). It is perhaps not 
surprising co find a sulfate ton in the nucleo- 
tide binding site because anunonium sulfate is 
a competitive inhibitor of the ATPase activity 
(43). A break in the electron density is 
observed between Glu 204 and Gry 216 at the for 
end of the active site pocket. The missing 
segment, which contains six charged residues, 
occurs at the 25- to 50-kD junction and is 
most likely a consrirutively flexible loop. 

The 50-kD fragment has a complex to- 
pology that can be described as two major 
domains separated by a long narrow cleft as 
is evident in the space-filling drawing (Fig. 
3) . This cleft divides the distal one-third of 
the myosin head into two regions, which 
are referred to as the upper and lower 
domains of the 50-kD segment (Fig. 3) . 

Electron density for the polypeptide 
chain resumes at Gly 216 as the start of an a 
helix (Leu 2Ifl to Gly 233 ). This helix forms 
part of the nucleotide binding pocket. 
Thereafter, the chain loops around close to 
the phosphate binding site and connects up 
to ^ strands six and seven of the large 0 
sheet motif that extend from Gly 247 to 
His 254 and Leu 260 to Tyr 268 , respectively. 
Strand seven terminates in a domain com- 
posed of random coil and several short hel- 
ices and extending from Glu 271 to Asp 327 . 
This region is located close to the nucleotide 
binding site and contains Ser 324 which had 
been previously identified by photolabeling 
to be an active site residue (44) (Fig. 7). An 
a helix extending from Asp 327 to He 340 
forms the top of the nucleotide binding 
pocket. After this domain, the polypeptide 
chain forms the end of the myosin head 
through a series of lone a helices. The 
longest of these is 45 A in length and 
extends from Val 419 to Leu 449 . Strand five of 
the large mixed 3 sheet follows this helix 
and extends from Tyr 457 to Ala 4 * 5 . This 
strand terminates in a random coil that drops 
from the "upper" to "lower" domains of the 
50-kD fragment. The midpoint between the 
upper and lower domains is located close to 
Gly 466 and occurs in a region of the sequence 
(Tyr 457 to Gly 516 ) that is highly conserved in 
all myosins (45). Furthermore, the cleft 
itself contains many individual highly con- 
served residues that extend into the space 
between the two domains. 

The lower domain is built from several 
long a helices (Phe 47S to Lys 505 and Met 517 
to Glu 539 ), the last of which contains a 
hydrophobic bulge at Pro 529 . After another 
helix (Asp 547 to His 558 ) there is a three- 
stranded antiparallel p sheet, which includes 
residues Asn* 64 to Lys 567 , Phe 379 to Val 5fl2 , 
and Thr 587 to Tyr 590 . The electron density 
for Lys 572 , Gly 573 , and Lys 574 is very weak, 
and therefore these residues have been ex- 
cluded from the model. The segment be- 
tween Pro 529 and Lys 353 is one component of 
the actin binding surface as defined by Ray- 



mem ex aL (46). A single segment of random 
coil (Lys 600 to Leu 603 ) passes from the lower 
domain and across the cleft to form a helix- 
loop-helix motif on the outer face of the 
upper 50-kD domain and terminating at 
Tyr 626 . There is no electron density corre- 
sponding to amino acid residues Gry 627 to 
Phe 646 . This particular stretch contains the 
second major site of trypsin proteolysis and is 
the junction between the 50- and 20-kD 
fragments. The primary sequence in this 
disordered region contains nine glycine and 
five lysine residues, suggesting that it may be 
a flexible region in the molecule. This site is 
resistant to proteolysis in the actomyosin 
complex and as such may contribute to the 
actin binding interface of myosin (33). In 
addition, this region has also been implicat- 
ed in actin binding from crosslinking and 
kinetic studies of proteolytically cleaved pro- 
tein (34. 47). 

Electron density for the polypeptide 
chain resumes at Gin* 47 and proceeds as a 
long a helix (Ser 650 to Arg 665 ) across the flat 
face of the molecule toward the light chain 
binding region and lies between the upper 
and lower domains of the 50-kD fragment. 
This helix is pan of a highly conserved 
segment that runs from Leu 658 to Asn 678 . At 
the end of the helix, the polypeptide chain 
turns into the center of the molecule and 
forms the third strand of the mixed $ sheet 
(His 668 to He 675 ). Thus the major tertiary 
motif of the head contains contributions 
from all three of the rryptic fragments. After 
leaving the 3 sheet, the polypeptide chain 
proceeds through the large surface loop de- 
fined by Thr 667 to Glu 687 , which caps one 
end of the nucleotide binding site pocket. 
Subsequently, the polypeptide chain forms 
two a helices lying under the nucleotide 
binding site and delineated by His 688 to 
Asn 698 and Val 700 to Arg 708 . This highly 
conserved segment in the sequence contains 
the two sulfhydryl groups, Cys 707 and 
Cys 697 , which are more reactive than the 
other 11 in the molecule and have been 
given the names SHI and SH2, respective- 
ly, in the order of their chemical reactivity. 
These two thiols can be crosslinked by oxi- 
dation and a wide variety of bifunctional 
chemical reagents differing in length from 1 4 
to 3 A but only in the presence of nucleotide 
(48). Indeed, formation of a covalent link 
between these two groups serves to trap 
Mg^-ADP (adenosine diphosphate) in the 
active site. Although these reactive sulfhy- 
dryls have been thought to reside in a flex- 
ible loop, the discovery that these two. resi- 
dues are separated by an a helix was surpris- 
ing (Fig. 7C). This is a well-defined region 
of the electron density map. The fact that 
the a carbons of Cys 697 and Cys 707 are 
approximately 18 A apart suggests that a 
rearrangement or conforrnational change in 
this area must occur upon nucleotide bind- 



ing. This point is further emphasized by the 
observation that SHI and SH2 both lie in 
small clefts that face out toward the solvent 
on opposite sides of the molecule. The func- 
tional significance of this region is also indi- 
cated by the very high degree of amino acid 
sequence conservation in this area of the 
molecule. 

The segment that follows the reactive 
sulfhydryl groups consists of a small three- 
stranded antiparallel 3 sheet that includes 
residues Arg 714 to Tyr 717 , Tyr 758 to Gly 761 , 
and Lys 764 to Phe 767 , and is associated with 
two short helices. This domain is separated 
from the adjacent NH 2 -terrninal domain of 
the 25-kD fragment of the heavy chain by a 
distinct cleft and shows a greater associa- . 
tion with the COOH-terminal domain of 
the essential light chain. Thereafter the 
heavy chain continues as a long a helix 
that shows distinct curvature beginning at 
Leu 771 and ending at Val 826 . There is a 
decided bend in the course of the polypep- 
tide chain resulting from the Trp 829 , Pro 830 , 
Trp 831 sequence (Figs- 4 and 5). The heavy 
chain terminates at residue Lys 843 after a 
small a helix that lies nearly at right angles 
to the preceding long helix. 

Effect of the reductive methylation on the 
protein structure and function. One question 
that must be addressed is the effect, if any, of 
reductive methylation on the conformation of 
the protein. An examination of the kinetic 
properties of modified myosin SI reveals that 
the protein is enzymatically active (49). 
There are changes in the kinetic parameters 
that are similar to those observed when only 
the reactive sulfhydryl groups are alkylated 

(50) . The results do not suggest any major 
changes in the overall confbrrnarion of the 
molecule since these would be expected to 
abolish its enzymatic activity. Myosin from 
most sources already contains several post- 
translationally modified amino acid residues. 
For example, in chicken skeletal myosin SI, 
Lys 35 is monomethylated, Lys 130 and Lys 551 
are trime thy laced, and His 757 contains a 3-N- 
methylated side chain (27). Although the 
role of these modified residues is unknown, it 
has been suggested that methylation of Lys 130 
provides a permanent positive charge that 
may become buried when nucleotide is bound 

(51) . In our structure, Lys 130 is exposed to the 
solvent at the edge of the nucleotide binding 
pocket. However, this region of the protein 
probably rearranges when nucleotide binds 
because the adjacent Trp 131 is photolabeled 
by two purine ATP analogues (52). 

The structure of methylated lysozyme is 
essentially identical to that of the native 
protein (15). From this it is not expected 
that the folding motifs in myosin SI will be 
significandy affected by this treatment. 
There is, however, the possibility that the 
relation between the various domains could 
be altered. Our data reveal that almost all of 
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the lysine residues are located at the surface 
and hence would not be expected to influ- 
ence the structure in any major way. The A2 
isozyme of chicken myosin SI contains 102 
lysine residues (27), of which 85 have been 
built into the model. Of those, 67 are 
located at the surface of the protein and only 
16 participate in salt bridges and can be 
considered buried. Five of these lysines par- 
ticipate in crystalline contacts. The remain- 
ing 17 lysine residues in the A2 isozyme are 
located in disordered loops; in our structure 
all except 4 lysine residues are reproducibry 
modified under conditions where 100 per- 
cent dimethylation is expected (Table 1) 
(15). Thus, the unmodified lysine residues 
are most likely located in salt bridges where 
they would be expected to have a higher 
pK 8 . Mass for the additional methyl groups 
on the lysine residues is evident in the 
electron density map for most of the well- 
ordered side chains. However at this resolu- 
tion it difficult to categorically decide if a 
residue has been modified based on the 
density alone. Even so, it appears that 
Lys 185 , which resides in the phosphate bind- 
ing loop, is not modified. 

Active site and possible mechanism for 
muscle conttactioru The catalytic site of the 
myosin head was identified by analogy to the 
phosphate binding loop in both the Ras pro- 
tein and adenylate kinase and by the position 
of the amino acid residues previously identi- 
fied by chemical studies with ATP analogues 
(52). The nucleotide binding pocket is locat- 
ed on the opposite side of the head from the 
proposed actin binding site and is in an open 
conformation (Figs. 4, 5, and 7). The view in 
Fig. 7 shows the position of the sulfate ion in 
the phosphate binding loop and a few of the 
amino acid residues that have been chemical- 
ly labeled, including Trp 1J1 , Ser 181 , Ser 243 , 
and Ser 324 (44, 52, 53). Tne width of the 
nucleotide binding pocket at its surface is 
approximately 15 A as measured between a 
carbons. Since the binding constant of myo- 
sin for Mg^-ATP is about 3 x 10 n (54) and 
residues on both sides of the cleft have been 
photochemically labeled, it is likely that the 
pocket closes when nucleotides bind in the 
active site. The pocket is approximately 13 A 
wide and 13 A deep with an angle between 
the faces of the pocket of —40°. The base of 
the cleft is located 90 A from the COOH- 
terminus of the myosin head. If the binding 
race to actin remains essentially stationary, 
closure of the nucleotide binding deft could 
produce a movement at the CXX)H-terminus 
of the myosin head of approximately 60 A. 
How this rearrangement is actually accom- 
plished cannot be easily predicted from our 
structure. 

The orientation of the molecule in Fig. 
5 is rotated such that the actin binding 
surface is approximately perpendicular to 
the page (46). Closure of the nucleotide 



binding pocket would rotate the COOH- 
terminal end of the heavy chain that carries 
the light chains toward the viewer, which is 
consistent with that expected for the start 
of the power stroke. From this perspective it 
appears chat a major function of the light 
chains is to create a longer molecule and 
hence amplify the conformational changes 
associated with the active site. 

Muscle contraction consists of the cyclic 
attachment and detachment of the myosin 
head to the actin filament with the con- 
comitant hydrolysis of ATP. From the ex- 
tensive kinetic studies on the interaction of 
myosin with actin (55) , a general picture of 
the sequence of kinetic events occurring 
during muscle contraction has emerged. 

Transient kinetic measurements originally 
demonstrated that transduction of the chem- 
ical energy released by the hydrolysis of ATP 
into directed mechanical force occurred dur- 
ing product release rather than during the 
hydrolysis step itself (56) . The cycle of events 
was summarized as follows: Mg 2+ -ATP rapid- 
ly dissociates the actomyosin complex by 
binding to the ATPase sites of myosin; free 
myosin then hydrolyzes ATP and forms a 
relatively stable myosin-products complex; ac- 
tin reccrobines with this complex and disso- 
ciates the products, thereby forming the orig- 
inal acon-myosin complex. Presumably, force 
is generated during the last step. Although 
this model provided an important conceptual 
framework for studies of the contractile cycle, 
it soon became clear that the interactions 
between myosin, actin, and the substrate and 
products were more complex (55). 

Structural information on the conforma- 
tional changes that occur during the actomy- 
osin interactions is limited. Addition of ATP 
causes no significant change in the amount of 
secondary structure as assessed by circular 
dichroism (57). The changes observed in 
tryptophan fluorescence are typical of most 
enzymes whose active sites are induced to fit 
around their substrates. However, significant 
movement within the myosin head must oc- 
cur during the ATPase activity because of the 
large change in distance between the two 
reactive cysteine residues (Cys 707 and Cys 697 ) 
that is induced when nucleotide binds (48, 
58). Recent low-angle x-ray scattering studies 
also suggest a large-scale movement during 
ATP hydrolysis (59). 

In formulating a model for muscle contrac- 
tion from the structure of myosin S 1 presented 
here, it must be understood that it neither 
contains nucleotide nor is bound to actin. 
Most likely the crystal structure is an interme- 
" diate between these two extremes, although 
probably closer to the actin bound state. 
Preliminary attempts to dock myosin (46) to 
actin suggest that a better fit to the image 
reconstructions of Si -decorated actin would 
be obtained if the long narrow cleft between 
the upper and lower 50-kD domains were to 



close, thus implying that this is an important 
structural feature of the molecule. In addition, 
the preliminary fit implies that the actin 
binding site contains components from both 
the upper and lower 50-kD domains and the 
first a helix from the 204D region. From the 
location of residues Tyr 626 and Gin 647 , the 
positively charged disordered segment at the 
50- to 20-kD junction could readily interact 
with the negatively charged amino acids at 
the NH 2 -terminus of actin. 

All the current kinetic models for the 
mechanism of muscle contraction require a 
change in the binding affinity of myosin for 
actin when ATP binds to the active site. 
Although it is difficult to predict how this 
effect can be communicated to the actin 
binding site, the structure suggests that this 
might be generated by changes in the relation 
between the upper and lower domains of the 
50-kD segment prompted by binding of the y 
phosphate. Examination of Fig. 7 reveals that 
the potential binding site for the -y phosphate 
would be located close to the confluence of 
the upper and lower domains of the 50-kD 
region below the current location of the sul- 
fate ion. These observations together with the 
information from docking myosin onto actin 
provide the information necessary to formu- 
late a basic structural model for muscle con- 
traction (46). 

The three-dimensional model of the my- 
osin SI presented in this article provides a 
molecular framework that can be used to 
address the issues of conformational changes 
during the contractile cycle and suggests 
how this molecule functions as a molecular 
motor. By a combination of molecular biol- 
ogy, in vitro motility assays, and chemical 
and kinetic studies, it should be possible to 
test these hypotheses concerning the molec- 
ular basis of motility. 

Finally, it is appropriate to consider why 
reductive methylacion allows this molecule 
to crystallize. Examination of the structure 
reveals that it contains elements of flexibil- 
ity that might lead to multiple conforma- 
tions in solution, which in turn might 
prevent the formation of a crystalline lat- 
tice. It is conceivable that reductive meth- 
ylation serves to stabilize one of these con- 
formations in solution. Alternatively, re- 
ductive methylation may serve only to re- 
duce the solubility of the protein. 
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bridges that extend from the myosin fila- 
ment and interact cyclically in a rowing 
motion with the actin filament as adenosine 
triphosphate (ATP) is hydrolyzed (J, 2). 

The myosin head is an ac tin-activated 
adenosine triphosphatase (ATPase). Both 
solution kinetic studies and fiber expert- 
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Complex and Its Implications for 
Muscle Contraction 
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Muscle contraction consists of a cyclical interaction between myosin and actin driven by 
the concomitant hydrolysis of adenosine triphosphate (ATP). A model for the rigor complex 
of F actin and the myosin head was obtained by combining the molecular structures of the 
individual proteins with the low-resolution electron density maps of the complex derived by 
cryo-electron microscopy and image analysis. The spatial relation between the ATP 
binding pocket on myosin and the major contact area on actin suggests a working hy- 
pothesis for the crossbridge cycle that is consistent with previous independent structural 
and biochemical studies. 
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